Crawler

A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawl assets (javascript, css and images).
Save to disk.
Hook for scraping content.
Restrict crawlable domains, paths or content types.
Limit concurrent crawlers.
Limit rate of crawling.
Set the maximum crawl depth.
Set timeouts.
Set retries strategy.
Set crawler's user agent.
Manually pause/resume/stop the crawler.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

Use Crawler.Store
Tap into the registry(?) Crawler.Store.DB
Use your own scraper
If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option	Type	Default Value	Description
`:assets`	list	`[]`	Whether to fetch any asset files, available options: `"css"`, `"js"`, `"images"`.
`:save_to`	string	`nil`	When provided, the path for saving crawled pages.
`:workers`	integer	`10`	Maximum number of concurrent workers for crawling.
`:interval`	integer	`0`	Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit.
`:max_depths`	integer	`3`	Maximum nested depth of pages to crawl.
`:max_pages`	integer	`:infinity`	Maximum amount of pages to crawl.
`:timeout`	integer	`5000`	Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`.
`:retries`	integer	`2`	Number of times to retry a fetch.
`:store`	module	`nil`	Module for storing the crawled page data and crawling metadata. You can set it to `Crawler.Store` or use your own module, see `Crawler.Store.add_page_data/3` for implementation details.
`:force`	boolean	`false`	Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data.
`:scope`	term	`nil`	Similar to `:force`, but you can pass a custom `:scope` to determine how Crawler should perform on links already seen.
`:user_agent`	string	`Crawler/x.x.x (...)`	User-Agent value sent by the fetch requests.
`:url_filter`	module	`Crawler.Fetcher.UrlFilter`	Custom URL filter, useful for restricting crawlable domains, paths or content types.
`:retrier`	module	`Crawler.Fetcher.Retrier`	Custom fetch retrier, useful for retrying failed crawls, nullifies the `:retries` option.
`:modifier`	module	`Crawler.Fetcher.Modifier`	Custom modifier, useful for adding custom request headers or options.
`:scraper`	module	`Crawler.Scraper`	Custom scraper, useful for scraping content as soon as the parser parses it.
`:parser`	module	`Crawler.Parser`	Custom parser, useful for handling parsing differently or to add extra functionalities.
`:encode_uri`	boolean	`false`	When set to `true` apply the `URI.encode` to the URL to be crawled.
`:queue`	pid	`nil`	You can pass in an `OPQ` pid so that multiple crawlers can share the same queue.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See Crawler.Fetcher.Modifier.

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("https://elixir-lang.org")

Crawler.running?(opts) # => true

Crawler.pause(opts)

Crawler.running?(opts) # => false

Crawler.resume(opts)

Crawler.running?(opts) # => true

Crawler.stop(opts)

Crawler.running?(opts) # => false

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

Multiple Crawlers

It is possible to start multiple crawlers sharing the same queue.

{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)

Crawler.crawl("https://elixir-lang.org", queue: queue)
Crawler.crawl("https://github.com", queue: queue)

Find All Scraped URLs

Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]

Examples

Google Search + Github

This example performs a Google search, then scrapes the results to find Github projects and output their name and description.

See the source code.

You can run the example by cloning the repo and run the command:

mix run -e "Crawler.Example.GoogleSearch.run()"

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

Copyright and License

This work is free. You can redistribute it and/or modify it under the terms of the MIT License.

Name	Name	Last commit message	Last commit date
Latest commit sthepot and fredwu Restore extra_applications back to [:logger] only Jun 19, 2024 6866bbe · Jun 19, 2024 History 285 Commits
.github/workflows	.github/workflows	Update CI config	Sep 29, 2023
config	config	Tweak logger levels	Sep 30, 2023
examples	examples	Tweak example code	Sep 30, 2023
lib	lib	Set `:store` to `nil` again	Oct 13, 2023
test	test	Change `:store` and improve `max_pages` test	Oct 11, 2023
.formatter.exs	.formatter.exs	Add recode	Sep 28, 2023
.gitignore	.gitignore	Misc doc changes	Oct 13, 2021
.recode.exs	.recode.exs	Add recode	Sep 28, 2023
.tool-versions	.tool-versions	Improve memory usage	Sep 28, 2023
CHANGELOG.md	CHANGELOG.md	Set `:store` to `nil` again	Oct 13, 2023
README.md	README.md	Set `:store` to `nil` again	Oct 13, 2023
architecture.svg	architecture.svg	Update architecture svg to render better [ci skip]	Sep 24, 2023
mix.exs	mix.exs	Restore extra_applications back to [:logger] only	Jun 19, 2024
mix.lock	mix.lock	Add recode	Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Features

Architecture

Usage

Configurations

Custom Modules

Retrier

URL Filter

Scraper

Parser

Modifier

Pause / Resume / Stop Crawler

Multiple Crawlers

Find All Scraped URLs

Examples

Google Search + Github

API Reference

Changelog

Copyright and License

About

Releases 7

Packages

Contributors 9

Languages

fredwu/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Features

Architecture

Usage

Configurations

Custom Modules

Retrier

URL Filter

Scraper

Parser

Modifier

Pause / Resume / Stop Crawler

Multiple Crawlers

Find All Scraped URLs

Examples

Google Search + Github

API Reference

Changelog

Copyright and License

About

Topics

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 9

Languages

Packages