GitHub - fu/fass: This simple server enables scraping of website with dynamic content.

FASS - FastAPI - Selenium - Scraper

This simple server enables scraping of website with dynamic content. It exposes the parser via rest API: http://localhost:8000/parse and accepts POST in the form of, e.g.

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

the payload contains a list of websites to scrape, each containing the url a name, delay in seconds and patterns. The two first kwargs are self explenatory, the delay parameters defines how many seconds the selenium driver should wait until the page is scraped. The pattern represent a list of entities to extract from the page, defined by Python regex expression and a name which will be used in the returned json.

The example above return:

{
    "name":"Github stars",
    "all_fields_matched":true,
    "Star Counter":["154"]
}

Please note that the matched values are always a list since we match all occurences on page. If multiple Python regex groups are defined, the returned list will contain tuples.

Installation

From source

Clone this repo and

docker build -t fass_app .

From Docker hub

docker pull zerealfu/fass:latest

Running the service

docker run -d -p 8000:8000 fass_app

then execute the curl for example:

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

Have fun :)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
fass		fass
logo		logo
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASS - FastAPI - Selenium - Scraper

Installation

From source

From Docker hub

Running the service

About

Releases 3

Packages

Languages

License

fu/fass

Folders and files

Latest commit

History

Repository files navigation

FASS - FastAPI - Selenium - Scraper

Installation

From source

From Docker hub

Running the service

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages