Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Updates -- Scraper Piece #269

Closed
showerst opened this issue Mar 16, 2017 · 7 comments
Closed

Incremental Updates -- Scraper Piece #269

showerst opened this issue Mar 16, 2017 · 7 comments

Comments

@showerst
Copy link

This is related to #169 but a simplification, just want to test the waters here.

Many jurisdictions allow searching by start/end dates, allowing for us to only scrape items that were changed. This differs from the idea of only scraping the actual diffs from the database in #169. The data ingest would proceed normally, I just only want to scrape/save objects with changes before the ingest step even happens.

I propose:

  1. We add two new optional flags, start_date and end_date
  2. We parse them into python datetimes
  3. Scrapers can optionally implement a scrape_range function, that adds start_date and end_date function arguments, and otherwise follows the function signature of the scrape function.
  4. If there's no scrape_range, we'd either error out, or show a message and then scrape all.

Thoughts?

@fgregg
Copy link
Contributor

fgregg commented Mar 16, 2017 via email

@showerst
Copy link
Author

showerst commented Mar 16, 2017

That was my thought; we could either set an in-stone format (have to think through optional times and timezones!), or just use a datetime parser.

scrape nc --bills --start_date "2017-01-01" 
scrape nc --bills --start_date="2017-01-01 09:22" --end_date="2017-01-01 13:43" 

For context i'm really thinking about this to enable scrapes run on a cron-job to be fast, and to reduce load on the jurisdiction sites/apis.

Obviously it should be something a human can type, but i'm less concerned about the exact format. We could always go ISO 8601 "2008-09-15T15:53:00" if ambiguity was a concern.

@jamesturk
Copy link
Member

jamesturk commented Mar 16, 2017 via email

@showerst
Copy link
Author

@jamesturk oh sweet. I'll just table this until the weekend, sounds like it might turn out to be easy-ish. If it's something that we roll with I'd want to standardize it though so it's not just me tossing arguments everywhere.

One additional thought: Some jurisdictions do allow date filters, but limit the resolution to days. IMHO it should be up to the individual scraper writers to handle that case, but we should standardize whether that's a warning and it just scrapes with times truncated, or an error.

@showerst
Copy link
Author

@jamesturk could describe how to pass an arbitrary variable into a scraper?

Use case is as above, I want to pass a date and then use it in the bill scraper.

@jamesturk
Copy link
Member

jamesturk commented Sep 25, 2017 via email

@showerst
Copy link
Author

showerst commented Mar 1, 2018

Just closing this old bug, the local scrapers take a 'window' argument so i'll adopt that if we ever do this.

Thanks!.

@showerst showerst closed this as completed Mar 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants