Scraping Pokémon Database using the Playwright library combined with asynchronous Python.
Think of something elegant for the following:
-
First run
poetry install
poetry run playwright install firefox
-
Scrape
poetry run scrape
-
Query
poetry run query
- Write docstrings for all classes, functions, etc.
- Add logging calls to all operations to log in the console, and logging dumps.
- Type defenses
- Evolution chart
- Bulbasaur changes
- Name origin
- Moves learned by Bulbasaur: Requires another loop; lot of work
- Find a way to properly tie the data together.
- Such that everything is properly grouped by e.g. Pokémon.
- Create document table for Pokémon details.
- Insert data into it from the coroutines.
- Create document tables for other document objects.
- Set up initial data models.
- Finish and validate the data models.
- Create CRUD methods in the classes.
- Insert data into the models from the coroutines.
- Optimize the models
- Convert to proper data types, instead of using
TextField
. - Set appropriate contrains (
null
,unique
, etc.) - Create appropriate indices for the tables.
- Convert to proper data types, instead of using
- Data below the table Base stats does not get scraped.
- This happens when the table does not occur in the expected
nth
position. - Suggestion: Find tables in relation to position of the header (e.g. Base stats), in order to properly determine its location.
- This happens when the table does not occur in the expected
This branch still needs major work done.
- Fix timeout issue
- Currently, the
asyncio.gather()
call for the Pokémon detail concurrent batch scraping causes a timeout.
- Currently, the