Forager API Reference
The Forager module is responsible for web scraping and data extraction in MacScrape.
Class: ForagerModel
Initialization
Methods
async sniff_data(urls: List[str], max_depth: int = 2) -> Dict
Crawls the given URLs and extracts data.
Parameters:
urls
: List of URLs to crawlmax_depth
: Maximum depth for crawling (default: 2)
Returns:
- A dictionary containing the extracted data for each URL
Internal Methods
async _crawl(url: str, max_depth: int, current_depth: int = 0) -> Dict
Recursive method to crawl a single URL and its links.
_extract_headers(soup: BeautifulSoup) -> Dict
Extracts header information from a BeautifulSoup object.
_extract_links(soup: BeautifulSoup, base_url: str) -> List[Dict]
Extracts links from a BeautifulSoup object.
Crawling Process
graph TD
A[Start] --> B[Fetch URL]
B --> C{Max Depth Reached?}
C -->|Yes| D[Extract Data]
C -->|No| E[Parse HTML]
E --> F[Extract Links]
F --> G{More Links?}
G -->|Yes| H[Select Next Link]
H --> B
G -->|No| D
D --> I[Return Data]
I --> J[End]
Configuration Options
Option | Description | Default |
---|---|---|
user_agent | User agent string for requests | "MacScrape/1.0" |
timeout | Request timeout in seconds | 30 |
respect_robots_txt | Whether to respect robots.txt | True |
max_retries | Maximum number of retries for failed requests | 3 |
Error Handling
- Network errors are retried with exponential backoff
- Malformed HTML is handled gracefully, extracting whatever data is available
- Timeouts and access denied errors are logged and skipped
Best Practices
- Respect website terms of service and robots.txt
- Implement rate limiting to avoid overloading servers
- Use caching to reduce unnecessary requests
- Regularly update the user agent string
- Handle different content types (HTML, JSON, XML) appropriately
Example: Custom Crawling Logic
class CustomForager(ForagerModel):
async def _extract_custom_data(self, soup):
# Implement custom data extraction logic
pass
async def _crawl(self, url, max_depth, current_depth=0):
data = await super()._crawl(url, max_depth, current_depth)
data['custom_data'] = await self._extract_custom_data(soup)
return data
Next Steps
Learn how to process and analyze the scraped data with the AI Regenerator API.