Skip to content
/ fass Public

This simple server enables scraping of website with dynamic content.

License

Notifications You must be signed in to change notification settings

fu/fass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FASS - FastAPI - Selenium - Scraper

Latest Tag Build Status License Docker Pulls Docker Image Size

This simple server enables scraping of website with dynamic content. It exposes the parser via rest API: http://localhost:8000/parse and accepts POST in the form of, e.g.

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

the payload contains a list of websites to scrape, each containing the url a name, delay in seconds and patterns. The two first kwargs are self explenatory, the delay parameters defines how many seconds the selenium driver should wait until the page is scraped. The pattern represent a list of entities to extract from the page, defined by Python regex expression and a name which will be used in the returned json.

The example above return:

{
    "name":"Github stars",
    "all_fields_matched":true,
    "Star Counter":["154"]
}

Please note that the matched values are always a list since we match all occurences on page. If multiple Python regex groups are defined, the returned list will contain tuples.

Installation

From source

Clone this repo and

docker build -t fass_app .

From Docker hub

docker pull zerealfu/fass:latest

Running the service

docker run -d -p 8000:8000 fass_app

then execute the curl for example:

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

Have fun :)

About

This simple server enables scraping of website with dynamic content.

Resources

License

Stars

Watchers

Forks

Packages

No packages published