AnyCrawl is a high‑performance crawling and scraping toolkit:
- SERP crawling: multiple search engines, batch‑friendly
- Web scraping: single‑page content extraction
- Site crawling: full‑site traversal and collection
- High performance: multi‑threading / multi‑process
- Batch tasks: reliable and efficient
- AI extraction: LLM‑powered structured data (JSON) extraction from pages
LLM‑friendly. Easy to integrate and use.
📖 See full docs: Docs
💡 Use the Playground to test APIs and generate code in your preferred language.
If self‑hosting, replace
https://api.anycrawl.dev
with your own server URL.
curl -X POST https://api.anycrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "cheerio"
}'
Parameter | Type | Description | Default |
---|---|---|---|
url | string (required) | The URL to be scraped. Must be a valid URL starting with http:// or https:// | - |
engine | string | Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome) |
cheerio |
proxy | string | Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port |
(none) |
More parameters: see Request Parameters.
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"schema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"is_open_source": { "type": "boolean" },
"employee_count": { "type": "number" }
},
"required": ["company_mission"]
}
}
}'
curl -X POST https://api.anycrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "playwright",
"max_depth": 2,
"limit": 10,
"strategy": "same-domain"
}'
Parameter | Type | Description | Default |
---|---|---|---|
url | string (required) | Starting URL to crawl | - |
engine | string | Crawling engine. Options: cheerio , playwright , puppeteer |
cheerio |
max_depth | number | Max depth from the start URL | 10 |
limit | number | Max number of pages to crawl | 100 |
strategy | enum | Scope: all , same-domain , same-hostname , same-origin |
same-domain |
include_paths | array | Only crawl paths matching these patterns | (none) |
exclude_paths | array | Skip paths matching these patterns | (none) |
scrape_options | object | Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options | (none) |
More parameters and endpoints: see Request Parameters.
curl -X POST https://api.anycrawl.dev/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"query": "AnyCrawl",
"limit": 10,
"engine": "google",
"lang": "all"
}'
Parameter | Type | Description | Default |
---|---|---|---|
query |
string (required) | Search query to be executed | - |
engine |
string | Search engine to use. Options: google |
|
pages |
integer | Number of search result pages to retrieve | 1 |
lang |
string | Language code for search results (e.g., 'en', 'zh', 'all') | en-US |
- Can I use proxies? Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the
proxy
request parameter (per request) orANYCRAWL_PROXY_URL
(self‑hosting). - How to handle JavaScript‑rendered pages? Use the
Playwright
orPuppeteer
engines.
We welcome contributions! See the Contributing Guide.
MIT License — see LICENSE.
We build simple, reliable, and scalable tools for the AI ecosystem.