Hydra: multithreaded site-crawling link checker in Python

A Python program that ~~crawls~~ slithers 🐍 a website for links and prints a YAML report of broken links.

Requires

Python 3.6 or higher.

There are no external dependencies, Neo.

Usage

$ python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL

Positional arguments:

URL: The URL of the website to crawl. Ensure URL is absolute including schema, e.g. https://example.com.

Optional arguments:

-h, --help: Show help message and exit
--config CONFIG, -c CONFIG: Path to a configuration file

A broken links report will be output to stdout, so you may like to redirect this to a file.

The report will be YAML formatted. To save the output to a file, run:

python hydra.py [URL] > [PATH/TO/FILE.yaml]

You can add the current date to the filename using a command substitution, such as:

python hydra.py [URL] > /path/to/$(date '+%Y_%m_%d')_report.yaml

To see how long Hydra takes to check your site, add time:

time python hydra.py [URL]

GitHub Action

You can easily incorporate Hydra as part of an automated process using the link-snitch action.

Configuration

Hydra can accept an optional JSON configuration file for specific parameters, for example:

{
    "OK": [
        200,
        999,
        403
    ],
    "attrs": [
        "href"
    ],
    "exclude_scheme_prefixes": [
        "tel"
    ],
    "tags": [
        "a",
        "img"
    ],
    "threads": 25,
    "timeout": 30,
    "graceful_exit": "True"
}

To use a configuration file, supply the filename:

python hydra.py https://example.com --config ./hydra-config.json

Possible settings:

OK - HTTP response codes to consider as a successful link check. Defaults to [200, 999].
attrs - Attributes of the HTML tags to check for links. Defaults to ["href", "src"].
exclude_scheme_prefixes - HTTP scheme prefixes to exclude from checking. Defaults to ["tel:", "javascript:"].
tags - HTML tags to check for links. Defaults to ["a", "link", "img", "script"].
threads - Maximum workers to run. Defaults to 50.
timeout - Maximum seconds to wait for HTTP response. Defaults to 60.
graceful_exit - If set to True, and there are broken links present return exit code 0 else return exit code 1.

Test

Run:

python -m unittest tests/test.py

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
hydra.py		hydra.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hydra: multithreaded site-crawling link checker in Python

Requires

Usage

GitHub Action

Configuration

Test

About

Releases

Sponsor this project

Packages

Contributors 6

Languages

License

victoriadrake/hydra-link-checker

Folders and files

Latest commit

History

Repository files navigation

Hydra: multithreaded site-crawling link checker in Python

Requires

Usage

GitHub Action

Configuration

Test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 6

Languages

Packages