Skip to content

Conversation

@honzajavorek
Copy link
Member

@honzajavorek honzajavorek commented Jan 29, 2025

WIP

  • scrapy scraper doesn't scrape correct number of items (why?), seems like it skips some, always different ones (scrapy gives 50, actor gives 52, website says 53)
  • fix pagination so that it only enqueues the next page
  • actor scraper doesn't end - it's because of pagination! scrapy can correctly dedupe, actor cannot. task: get a minimal reproducible example and file a bug on SDK In the end @vdusek managed to reproduce this with minimal code and told me I don't have to debug this further
  • inspect what upgrades I did recently, scrapy version, etc.
  • fix parsing error
  • cache optimization: gzip cache
  • cache optimization: do lookups only for safe/idempotent requests leaving this up to the fingerprinter
  • cache optimization: save cache under single key
  • cache optimization: remove 10 expired cache keys on each close (garbage collector)
  • run_scraper
  • double logging, remove my custom logging mechanics
  • remove Actor.log feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio apify/apify-sdk-python#390 (comment)
  • merge Bump the python group with 3 updates #111
  • after merging, put broken scraper back to schedule
  • create pull request to apify sdk with cache

@honzajavorek honzajavorek marked this pull request as ready for review February 13, 2025 16:52
@honzajavorek honzajavorek merged commit 0275b10 into main Feb 13, 2025
1 check passed
@honzajavorek honzajavorek deleted the honzajavorek/threads branch February 13, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant