Kind of Framework for Web Scraping

Goals: To obtain a medium scalable system while maintaining flexibility (including through extended configuration) for data collection with support for paralleling, proxies.

Features

Modular architecture with support for various parsers for different websites
Asynchronous HTTP client based on AIOHTTP with session management
Retry system for request errors
Proxy support with automatic rotation (todo: will be moved to a separate service)
WIP: Enhanced logging
Mechanism for saving data in various formats

Todo:

improve generate-scraper logic
Finish integration of Middleware with metrics (they are collected, but not displayed anywhere)
Implement db_storage
Simplify the logic of writing the parsers by creating more "helpers" and extend core/base_scraper.py with new functionality

Project Architecture

scraper_project/
├── cli.py                 # Command-line interface
├── config/                # Configurations and config manager
│   ├── config_manager.py  # Configuration manager
│   ├── parsers_config.yaml # Main file for parser-specific settings
│   ├── config_models.py   # Pydantic models for configuration
│   └── ...
├── core/                  # System core
│   ├── base_scraper.py    # Base class for all scrapers
│   ├── data_models.py     # Data models
│   ├── exceptions.py      # Exceptions
│   └── ...
├── infrastructure/        # Infrastructure components
│   ├── http_client.py     # HTTP client
│   ├── proxy_manager.py   # Proxy management
│   ├── metrics.py         # Metrics collection # deprecated
│   ├── storage/           # Storage subsystem
│   │   ├── base_storage.py
│   │   ├── csv_storage.py
│   │   └── ...
│   └── ...
├── parsers/               # Parser implementations and templates
│   ├── parser_factory.py  # Parser factory
│   ├── implementations/   # Specific parser implementations
│   │   ├── knifecenter_parser.py
│   │   └── ...
│   └── templates/         # Templates for auto-generation
│       └── scraper_template.py
├── tests/                 # Tests, grouped by modules
│
├── utils/                 # Utilities
│   ├── logger_factory.py  # Logger factory
│   ├── retry_utils.py     # Decorator for retries (deprecated)
│   ├── html_utils.py      # HTML utilities
│   └── ...
└── README.md

Installation

Clone the repository:

git clone [email protected]:dontlosecontrol/pyscraper.git
cd scraper_project

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # for Linux/Mac
# or
venv\Scripts\activate  # for Windows

Install dependencies:

pip install -r requirements.txt

Usage

Via Command Line

# List available parsers
python cli.py list-parsers

# Run a scraper for a specific site, saving results to CSV
# The filename will be generated automatically (e.g., knifecenter_YYYY-MM-DD.csv)
python cli.py scrape --parser knifecenter --urls https://www.knifecenter.com/knives.html --output-type csv

# Use a file with URLs and specify the number of concurrent tasks
python cli.py scrape --parser knifecenter --urls-file urls.txt --concurrency 5

Saving Results in JSON

To save results in JSON format, you can specify the storage type via CLI or in the config/parsers_config.yaml configuration file for a specific parser. The output filename will be generated automatically.

Example of specifying in config/parsers_config.yaml for the my_shop parser:

parsers:
  my_shop:
    storage:
      type: "json"
    # output_file will be generated automatically, e.g.: my_shop_YYYY-MM-DD.json

Or pass the storage type via CLI:

python cli.py scrape --parser knifecenter --urls https://example.com/category --output-type json

Results will be saved to an automatically generated file (e.g., knifecenter_YYYY-MM-DD.json) in JSON format.

Creating Your Own Parser

Quick Way: Auto-generation via CLI

You can automatically generate a template for a new scraper and register it in the system with a single command:

python cli.py generate-scraper --shop-name my_shop --description "Parser for My Shop"

A file parsers/implementations/my_shop_parser.py will be created based on the template.
The parser is automatically registered in the system via a decorator.
You will need to implement the parsing logic in the generated file and, if necessary, add or update the configuration for your parser in the config/parsers_config.yaml file.

Registration Decorator Example:

@register_parser_decorator('my_shop', 'Parser for My Shop')
class MyShopScraper(BaseScraper):
    ...

Next Steps:

Implement the parser methods in the generated file.
(Recommended) If necessary, add or update specific settings for your parser in the config/parsers_config.yaml file (e.g., base_url, selectors, etc.).

Example: parsers/implementations/knifecenter_parser.py

Alternative Way: Manually

Create or update the section for your parser in the config/parsers_config.yaml file. This file is used to override default settings or add parser-specific parameters (e.g., user_agent, delay, even XPaths, etc.).

parsers:
  # ...other parsers...
  my_shop:
    # Specific configuration for My Shop
    user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124 Safari/537.36"
    concurrency: 3
    timeout: 20
    # sessions_count, retries_count, and other parameters from ScraperConfig can be overridden here
    # You can also add custom parameters that your parser will read
    # e.g.: base_url: "https://my.shop.com"

Create a parser class in the parsers/implementations/ directory (e.g., my_shop_parser.py):

from parsers.parser_factory import register_parser_decorator

@register_parser_decorator('my_shop', 'Parser for My Shop')
class MyShopScraper(BaseScraper):
    ...

Implement the parser methods.
Ensure that all necessary configurations for your parser are present in config/parsers_config.yaml or that default values from ScraperConfig are correctly processed.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kind of Framework for Web Scraping

Features

Todo:

Project Architecture

Installation

Usage

Via Command Line

Saving Results in JSON

Creating Your Own Parser

Quick Way: Auto-generation via CLI

Alternative Way: Manually

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
core		core
infrastructure		infrastructure
parsers		parsers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
cli.py		cli.py
requirements.txt		requirements.txt

License

dontlosecontrol/pyscraper

Folders and files

Latest commit

History

Repository files navigation

Kind of Framework for Web Scraping

Features

Todo:

Project Architecture

Installation

Usage

Via Command Line

Saving Results in JSON

Creating Your Own Parser

Quick Way: Auto-generation via CLI

Alternative Way: Manually

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages