Skip to content

LPvdT/scraping-pokemon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Work In Progress

Scraping Pokémon Database using the Playwright library combined with asynchronous Python.

Todo

Poetry Bash scripts

Think of something elegant for the following:

  • First run

    • poetry install
    • poetry run playwright install firefox
  • Scrape

    • poetry run scrape
  • Query

    • poetry run query

Docstrings

  • Write docstrings for all classes, functions, etc.

Logging

  • Add logging calls to all operations to log in the console, and logging dumps.

Pokédex detail page: feature-scraper

Example Bulbasaur

  • Type defenses
  • Evolution chart
  • Bulbasaur changes
  • Name origin
  • Moves learned by Bulbasaur: Requires another loop; lot of work

Data dumps: feature-scraper

  • Find a way to properly tie the data together.
    • Such that everything is properly grouped by e.g. Pokémon.

NOSQL database: feature-db-nosql

  • Create document table for Pokémon details.
    • Insert data into it from the coroutines.
  • Create document tables for other document objects.

Relational database: feature-db-sql

  • Set up initial data models.
  • Finish and validate the data models.
  • Create CRUD methods in the classes.
  • Insert data into the models from the coroutines.
  • Optimize the models
    • Convert to proper data types, instead of using TextField.
    • Set appropriate contrains (null, unique, etc.)
    • Create appropriate indices for the tables.

Bugs

Pokédex detail page: feature-scraper

Example Pikachu

  • Data below the table Base stats does not get scraped.
    • This happens when the table does not occur in the expected nth position.
    • Suggestion: Find tables in relation to position of the header (e.g. Base stats), in order to properly determine its location.

Concurrent Pokémon details: feature-concurrent-details

This branch still needs major work done.

  • Fix timeout issue
    • Currently, the asyncio.gather() call for the Pokémon detail concurrent batch scraping causes a timeout.

About

Scraping Pokémon Database using the Playwright library combined with asynchronous Python.

Resources

Stars

Watchers

Forks