Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a less aggressive scraping strategy on Fridays #419

Closed
hancush opened this issue Mar 16, 2019 · 5 comments
Closed

Consider a less aggressive scraping strategy on Fridays #419

hancush opened this issue Mar 16, 2019 · 5 comments

Comments

@hancush
Copy link
Collaborator

hancush commented Mar 16, 2019

Recently, our aggressive event scrapes failed with 104 status codes for a period of several hours. Metro uploaded agendas during this team, but they did not show up on the site due to failures in the scraper (indeed, the problem the aggressive scrapes were meant to prevent!) Consider a less aggressive schedule. @reginafcompton suggests two scrapes, one complete and one with a small window, per hour.

@reginafcompton
Copy link
Contributor

For reference: the crons themselves.

@shrayshray, to decode the crons, we scrape all events at 0, 15, 30, and 45 minutes after the hour. We implemented this solution to handle issues arising with Legistar timestamps that did not change as expected, e.g., #310 and #267

We also aggressively scrape bills on Friday (again, to address timestamp issues, e.g., #328)

I think we should consider a multi-faceted revision of our crons:

  • Narrow the window for aggressive scraping. Currently, we scrape all bills and events, from 4:00 pm CST until 11:50 pm CST.
  • Minimize the number of times we hit Legistar by scraping all bills and events, once an hour, and include small windowed scrapes one or two times per hour.

@shrayshray - can you let us know how this sounds? and how we should prioritize making changes?

@shrayshray
Copy link
Collaborator

@reginafcompton this sounds good. This should be high priority. Do you have any concerns about implementing it right away, as there will also be an agenda posted this Friday?

@reginafcompton reginafcompton added this to the March 2019 Issues milestone Mar 18, 2019
@reginafcompton
Copy link
Contributor

@shrayshray - I can implement this solution tomorrow. I'd rather test this with our upcoming agenda on Friday, rather than wait until we have multiple agendas posted in April (which seems higher risk to me).

@shrayshray
Copy link
Collaborator

@reginafcompton sounds like a plan, thank you!

@reginafcompton
Copy link
Contributor

@shrayshray - we can close this issue via datamade/scrapers-us-municipal#32!

Summary of changes

On Fridays, from 2-10:00 pm CT ––

  • we scrape all events, on the hour,
  • we scrape recently updated events, at 30 and 45 after the hour
  • we scrape all bills, at 5 after the hour
  • we scrape recently updated bills, at 35 and 50 after the hour

This minimizes the load we place on Legistar, which should prevent an abundance of 104s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants