The purpose of this API is to expose alle the data that were available to scrape from https://www.urparts.com/index.cfm/page/catalogue/.
We're using uv
to manage our dependencies.
There is a Make file with a couple of handy recipes. Let's go through them:
make format
reformats whole codebase usingruff
;make check
does static code analysis usingruff
&mypy
;make test
runs test suite for whole application;make build
builds a Docker image for the application;make migrate
appliesalembic
migrations to the database;make scrape
runs the data scraping process;make up
runs the API;make down
puts down all Docker containers.
We're using aiohttp
for requesting all the pages in parts catalogue,
then we use BeautifulSoup4
to parse the data of interest.
Everything is being run in an asynchronous context, although not every possible aspect is parallelized.
Each manufacturer is parsed imperatively (one by one), and then, the categories, models & parts are parsed
in parallel.
There is a semaphore used to prevent abusing the website too much (it was raising Timeouts when).
Between 6 - 7 minutes for 4.4M parts (and much fewer manufacturers, categories & models) on 2021 M1 Pro.
The task is CPU-bound as of now, so it will consume a lot of CPU. Because of the parallelism it's not so IO-bound anymore.
The OpenAPI specification can be found here: http://localhost:8080/docs after starting the application.
Overall structure of the repository & API is based on domains. Each domain has its own category & directory.
For the simplicity of the solution - service
/controller
layer wasn't introduced.
For demonstration purposes, there's just one test written.
It won't work if you have any category in the database (there's no schema separation).