Skip to content

JulianNymark/webcrawler

Repository files navigation

webcrawler

MVP todo:

  • simple program to GET a html resource & find links (relative & absolute)
  • add the feature 'simple text analysis' (tally words used)
  • spin up a DB/Elasticsearch, connect to it, insert links & searchable data
  • make webcrawler 'oneshot' per page (it kinda already is, but keep it simple by keeping it this way)
  • perhaps another 'scheduler' service to fetch latest 'unprocessed' URL from DB, and fire 'oneshot' crawler on it (processing it).
    • exponential backoff on content diff (be the good guy!)
    • also log failure count on URL's, stop after multiple fails (consider it broken, regardless of actual error)

About

mini webcrawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published