Skip to content

Latest commit

 

History

History
159 lines (139 loc) · 6.01 KB

README.md

File metadata and controls

159 lines (139 loc) · 6.01 KB

GrabLinks

GitHub GitHub Code Size in Bytes Mastodon Follow Twitter Follow

Synopsis

grablinks.py is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.

Requirements

An installation of Python 3 (any version above 3.5 should do fine). Additionally the 3rd-party Python modules requests and beautifulsoup4 are required. Both modules can be easily installed with Python's package manager pip, e.g.:

pip --install requests --user
pip --install beautifulsoup4 --user

Usage

usage: grablinks.py [-h] [-V] [--insecure] [-f FORMATSTR] [--fix-links]
                    [--images] [-c CLASS] [-s SEARCH] [-x REGEX]
                    URL

Extracts, and optionally filters, all links (`<a href=""/>') from a remote
HTML document.

positional arguments:
  URL                   a fully qualified URL to the source HTML document

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version number and exit
  --insecure            disable verification of SSL/TLS certificates (e.g. to
                        allow self-signed certificates)
  -f FORMATSTR, --format FORMATSTR
                        a format string to wrap in the output: %url% is
                        replaced by found URL entries; %text% is replaced with
                        the text content of the link; other supported
                        placeholders for generated values: %id%, %guid%, and
                        %hash%
  --fix-links           try to convert relative and fragmental URLs to
                        absolute URLs (after filtering)
  --images              extract `<img src=""/>' instead `<a href=""/>'.

filter options:
  -c CLASS, --class CLASS
                        only extract URLs from href attributes of <a>nchor
                        elements with the specified class attribute content.
                        Multiple classes, separated by space, are evaluated
                        with an logical OR, so any <a>nchor that has at least
                        one of the classes will match.
  -s SEARCH, --search SEARCH
                        only output entries from the extracted result set, if
                        the search string occurs in the URL
  -x REGEX, --regex REGEX
                        only output entries from the extracted result set, if
                        the URL matches the regular expression

Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues

Usage Examples

# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…
# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱
# alternatively just pass to wget directly:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' | sort -u | wget -i-
# extract/ handle links like
# <a href="https://example.com/a-cryptic-ID">proper-filename.ext</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
#       recommended to verify this before executing things automatically

History

1.9 28-Dec-2024 Identify with proper user agents for remote requests
--fix-links: Update input/ response URL in case of redirections
--fix-links: Improved handling of some path edge-cases
Avoid unnecessary (re-)encoding (assume all loaded data as bytes)
Added basic support for 'file://' URIs
1.8 21-Nov-2024 Added support for "<img src="">" via '--images'.
1.7 21-Jan-2024 Disable urllib3 warnings when '--insecure' is used.
1.6 2-Dec-2023 Added '--insecure' argument to disable SSL/TLS certificate verification
Added support for '%text%' placeholder in format string (<a>text</a>)
1.5 24-Nov-2022 Added a (fixed) timeout to the remote request.
1.4 30-May-2022 Improved handling of passing multiple classes to '--class'.
1.3 6-Feb-2021 Fix: handling of common edge cases when '--fix-links' is used.
1.2 16-Aug-2020 Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result.
1.1 7-Jun-2020 Initial public source code release