Skip to content

DataCrawl-AI/datacrawl

Repository files navigation

cover

Tiny Web Crawler

CI Coverage badge Stable Version License: MIT Download Stats Discord

A simple and efficient web crawler for Python.

Features

  • Crawl web pages and extract links starting from a root URL recursively
  • Concurrent workers and custom delay
  • Handle relative and absolute URLs
  • Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings

settings = SpiderSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Spider(settings)
spider.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = SpiderSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Spider(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "https://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

  • If you are a first time contributor you can pick a good-first-issue and get started.
  • Please feel free to ask questions.
  • Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
  • We are working on doing our first major release. Please check this issue and see if anything interests you.

Dev setup

  • Install poetry in your system pipx install poetry
  • Clone the repo you forked
  • Create a venv or use poetry shell
  • Run poetry install --with dev
  • pre-commit install (see)
  • pre-commit install --hook-type pre-push

Before raising a PR. Please make sure you have these checks covered

  • An issue exists or is created which address the PR
  • Tests are written for the changes
  • All lint/test passes