Forager API Reference

The Forager module is responsible for web scraping and data extraction in MacScrape.

Class: ForagerModel

Initialization

from models.forager import ForagerModel

forager = ForagerModel()

Methods

async sniff_data(urls: List[str], max_depth: int = 2) -> Dict

Crawls the given URLs and extracts data.

Parameters:

urls: List of URLs to crawl
max_depth: Maximum depth for crawling (default: 2)

Returns:

A dictionary containing the extracted data for each URL

data = await forager.sniff_data(["https://example.com"], max_depth=3)

Internal Methods

async _crawl(url: str, max_depth: int, current_depth: int = 0) -> Dict

Recursive method to crawl a single URL and its links.

_extract_headers(soup: BeautifulSoup) -> Dict

Extracts header information from a BeautifulSoup object.

_extract_links(soup: BeautifulSoup, base_url: str) -> List[Dict]

Extracts links from a BeautifulSoup object.

Crawling Process

graph TD
    A[Start] --> B[Fetch URL]
    B --> C{Max Depth Reached?}
    C -->|Yes| D[Extract Data]
    C -->|No| E[Parse HTML]
    E --> F[Extract Links]
    F --> G{More Links?}
    G -->|Yes| H[Select Next Link]
    H --> B
    G -->|No| D
    D --> I[Return Data]
    I --> J[End]

Configuration Options

Option	Description	Default
user_agent	User agent string for requests	"MacScrape/1.0"
timeout	Request timeout in seconds	30
respect_robots_txt	Whether to respect robots.txt	True
max_retries	Maximum number of retries for failed requests	3

Error Handling

Network errors are retried with exponential backoff
Malformed HTML is handled gracefully, extracting whatever data is available
Timeouts and access denied errors are logged and skipped

Best Practices

Respect website terms of service and robots.txt
Implement rate limiting to avoid overloading servers
Use caching to reduce unnecessary requests
Regularly update the user agent string
Handle different content types (HTML, JSON, XML) appropriately

Example: Custom Crawling Logic

class CustomForager(ForagerModel):
    async def _extract_custom_data(self, soup):
        # Implement custom data extraction logic
        pass

    async def _crawl(self, url, max_depth, current_depth=0):
        data = await super()._crawl(url, max_depth, current_depth)
        data['custom_data'] = await self._extract_custom_data(soup)
        return data

Next Steps

Learn how to process and analyze the scraped data with the AI Regenerator API.