-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Updates -- Scraper Piece #269
Comments
How would start_date and end_date be set? Command line arguments?
…On Thu, Mar 16, 2017 at 2:35 PM, showerst ***@***.***> wrote:
This is related to #169 <#169>
but a simplification, just want to test the waters here.
Many jurisdictions allow searching by start/end dates, allowing for us to
only scrape items that were changed. This differs from the idea of only
scraping the actual diffs from the database in #169
<#169>. The data ingest would
proceed normally, I just only want to scrape/save objects with changes
before the ingest step even happens.
I propose:
1. We add two new optional flags, start_date and end_date
2. We parse them into python datetimes
3. Scrapers can optionally implement a scrape_range function, that
adds start_date and end_date function arguments, and otherwise follows the
function signature of the scrape function.
4. If there's no scrape_range, we'd either error out, or show a
message and then scrape all.
Thoughts?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#269>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAgxbXLvgSoGGaMHiiNh5K808uFdXULAks5rmY8cgaJpZM4Mfx7J>
.
--
773.888.2718
|
That was my thought; we could either set an in-stone format (have to think through optional times and timezones!), or just use a datetime parser.
For context i'm really thinking about this to enable scrapes run on a cron-job to be fast, and to reduce load on the jurisdiction sites/apis. Obviously it should be something a human can type, but i'm less concerned about the exact format. We could always go ISO 8601 "2008-09-15T15:53:00" if ambiguity was a concern. |
So this is actually already supported by virtue of pupa scrapers taking any
arguments we want (as long as they're all optional)
I'm not opposed to having some standard arguments available to scrapers
going forward & had some thoughts as I was porting OS scrapers about how
that might be done I'll write up, but we could at least start to experiment
w/ a state or two without any pupa changes.
…On Thu, Mar 16, 2017 at 4:31 PM, showerst ***@***.***> wrote:
That was my thought; we could either set an in-stone format (have to think
through optional times and timezones!), or just use a datetime parser.
scrape nc --bills --start_date "2017-01-01"
scrape nc --bills --start_date="2017-01-01 09:22" --end_date "2017-01-01 13:43"
For context i'm really thinking about this to enable scrapes run on a
cron-job to incremental and fast, and to reduce load on the jurisdiction
sites/apis.
Obviously it should be something a human can type, but i'm less concerned
about the exact format. We could always go ISO 8601 "2008-09-15T15:53:00"
if ambiguity was a concern.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAfYkfqnSYOapmzUOHhFfMVLVq19ZJhks5rmZwOgaJpZM4Mfx7J>
.
|
@jamesturk oh sweet. I'll just table this until the weekend, sounds like it might turn out to be easy-ish. If it's something that we roll with I'd want to standardize it though so it's not just me tossing arguments everywhere. One additional thought: Some jurisdictions do allow date filters, but limit the resolution to days. IMHO it should be up to the individual scraper writers to handle that case, but we should standardize whether that's a warning and it just scrapes with times truncated, or an error. |
@jamesturk could describe how to pass an arbitrary variable into a scraper? Use case is as above, I want to pass a date and then use it in the bill scraper. |
If the scrape() method takes a parameter 'foo' it can be passed in on the
command line as foo=bar
Let me know if that works
…On Fri, Sep 22, 2017 at 6:27 PM, showerst ***@***.***> wrote:
@jamesturk <https://github.com/jamesturk> could describe how to pass an
arbitrary variable into a scraper?
Use case is as above, I want to pass a date and then use it in the bill
scraper.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAfYgFFmMqlNQuFac8pn8xpF7neoPudks5slDRjgaJpZM4Mfx7J>
.
|
Just closing this old bug, the local scrapers take a 'window' argument so i'll adopt that if we ever do this. Thanks!. |
This is related to #169 but a simplification, just want to test the waters here.
Many jurisdictions allow searching by start/end dates, allowing for us to only scrape items that were changed. This differs from the idea of only scraping the actual diffs from the database in #169. The data ingest would proceed normally, I just only want to scrape/save objects with changes before the ingest step even happens.
I propose:
Thoughts?
The text was updated successfully, but these errors were encountered: