Skip to content

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

License

Notifications You must be signed in to change notification settings

any4ai/AnyCrawl

Repository files navigation

AnyCrawl

AnyCrawl

AnyCrawl

Fast Scalable Web Crawling Site Crawling SERP Multi Threading Multi Process Batch Tasks

License: MIT PRs Welcome LLM Ready Documentation

X

Node.js TypeScript Redis

📖 Overview

AnyCrawl is a high‑performance crawling and scraping toolkit:

  • SERP crawling: multiple search engines, batch‑friendly
  • Web scraping: single‑page content extraction
  • Site crawling: full‑site traversal and collection
  • High performance: multi‑threading / multi‑process
  • Batch tasks: reliable and efficient
  • AI extraction: LLM‑powered structured data (JSON) extraction from pages

LLM‑friendly. Easy to integrate and use.

🚀 Quick Start

📖 See full docs: Docs

📚 Usage Examples

💡 Use the Playground to test APIs and generate code in your preferred language.

If self‑hosting, replace https://api.anycrawl.dev with your own server URL.

Web Scraping (Scrape)

Example

curl -X POST https://api.anycrawl.dev/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter Type Description Default
url string (required) The URL to be scraped. Must be a valid URL starting with http:// or https:// -
engine string Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome) cheerio
proxy string Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port (none)

More parameters: see Request Parameters.

LLM Extraction

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'

Site Crawling (Crawl)

Example

curl -X POST https://api.anycrawl.dev/v1/crawl \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "playwright",
  "max_depth": 2,
  "limit": 10,
  "strategy": "same-domain"
}'

Parameters

Parameter Type Description Default
url string (required) Starting URL to crawl -
engine string Crawling engine. Options: cheerio, playwright, puppeteer cheerio
max_depth number Max depth from the start URL 10
limit number Max number of pages to crawl 100
strategy enum Scope: all, same-domain, same-hostname, same-origin same-domain
include_paths array Only crawl paths matching these patterns (none)
exclude_paths array Skip paths matching these patterns (none)
scrape_options object Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options (none)

More parameters and endpoints: see Request Parameters.

Search Engine Results (SERP)

Example

curl -X POST https://api.anycrawl.dev/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter Type Description Default
query string (required) Search query to be executed -
engine string Search engine to use. Options: google google
pages integer Number of search result pages to retrieve 1
lang string Language code for search results (e.g., 'en', 'zh', 'all') en-US

Supported search engines

  • Google

❓ FAQ

  1. Can I use proxies? Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the proxy request parameter (per request) or ANYCRAWL_PROXY_URL (self‑hosting).
  2. How to handle JavaScript‑rendered pages? Use the Playwright or Puppeteer engines.

🤝 Contributing

We welcome contributions! See the Contributing Guide.

📄 License

MIT License — see LICENSE.

🎯 Mission

We build simple, reliable, and scalable tools for the AI ecosystem.


Built with ❤️ by the Any4AI team

About

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Languages