Skip to content

Forager API Reference

The Forager module is responsible for web scraping and data extraction in MacScrape.

Class: ForagerModel

Initialization

from models.forager import ForagerModel

forager = ForagerModel()

Methods

async sniff_data(urls: List[str], max_depth: int = 2) -> Dict

Crawls the given URLs and extracts data.

Parameters:

  • urls: List of URLs to crawl
  • max_depth: Maximum depth for crawling (default: 2)

Returns:

  • A dictionary containing the extracted data for each URL
data = await forager.sniff_data(["https://example.com"], max_depth=3)

Internal Methods

async _crawl(url: str, max_depth: int, current_depth: int = 0) -> Dict

Recursive method to crawl a single URL and its links.

_extract_headers(soup: BeautifulSoup) -> Dict

Extracts header information from a BeautifulSoup object.

_extract_links(soup: BeautifulSoup, base_url: str) -> List[Dict]

Extracts links from a BeautifulSoup object.

Crawling Process

graph TD
    A[Start] --> B[Fetch URL]
    B --> C{Max Depth Reached?}
    C -->|Yes| D[Extract Data]
    C -->|No| E[Parse HTML]
    E --> F[Extract Links]
    F --> G{More Links?}
    G -->|Yes| H[Select Next Link]
    H --> B
    G -->|No| D
    D --> I[Return Data]
    I --> J[End]

Configuration Options

Option Description Default
user_agent User agent string for requests "MacScrape/1.0"
timeout Request timeout in seconds 30
respect_robots_txt Whether to respect robots.txt True
max_retries Maximum number of retries for failed requests 3

Error Handling

  • Network errors are retried with exponential backoff
  • Malformed HTML is handled gracefully, extracting whatever data is available
  • Timeouts and access denied errors are logged and skipped

Best Practices

  1. Respect website terms of service and robots.txt
  2. Implement rate limiting to avoid overloading servers
  3. Use caching to reduce unnecessary requests
  4. Regularly update the user agent string
  5. Handle different content types (HTML, JSON, XML) appropriately

Example: Custom Crawling Logic

class CustomForager(ForagerModel):
    async def _extract_custom_data(self, soup):
        # Implement custom data extraction logic
        pass

    async def _crawl(self, url, max_depth, current_depth=0):
        data = await super()._crawl(url, max_depth, current_depth)
        data['custom_data'] = await self._extract_custom_data(soup)
        return data

Next Steps

Learn how to process and analyze the scraped data with the AI Regenerator API.