Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental updates #169

Open
fgregg opened this issue Apr 21, 2015 · 17 comments
Open

Incremental updates #169

fgregg opened this issue Apr 21, 2015 · 17 comments
Labels

Comments

@fgregg
Copy link
Contributor

fgregg commented Apr 21, 2015

Chicago has a lot of legislation and long legislative sessions (4 years). This means that it currently takes about 48 hours to scrape the site every week.

There are some strategies for a smarter scraping strategy but they all require that the scraper know, to some degree, what's already in the database.

Right now, the scraper doesn't know anything about the database, and that seems like it has been a good and sane thing. How should we proceed?

One idea is for the scraper to maybe hit the OCD API? Is that a looser coupling? Thoughts please!

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

We've been calling this 'incremental' updates internally.

CC'ing @boblannon

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

Some context from opencivicdata/scrapers-us-municipal#25 (comment) about how we could scraper smarter:

Two ideas:

We could leverage the legistar API, which has as last updated search parameter. But we would still need to know when the last time we scraped was. (which pupa does not seem to currently have a way to know).
On the website the search results seem to be returned in last updated order. We could stop scraping once the order of of a window of 100 bills on the website matches a windows of 100 bills in the DB. This seems like the most generic solution, since other cities do not have a legistar API.

@paultag paultag added this to the singularity milestone Apr 21, 2015
@fgregg fgregg changed the title Smart updates Incremental updates Apr 21, 2015
@boblannon
Copy link
Contributor

The way I've been doing this on my fork was to do my best to make scrapes idempotent. the goal was for a scrape to be all no-ops if an identical scrape had already been done.

I did it mostly by narrowing the get_object calls using an object's source. This happens if the object's (new) source_identified property is True. I've done this for...

It works pretty well, even when I re-run a scrape after having merged people or organizations.

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

FWIW; stock Pupa scrapes are more idempotent than incremental scrapers, since we don't have to rely on carying state without full rescrapes to ensure state

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

Don't see see how that importer code will reduce the number of pages the scraper will visit. Sorry if I'm being dense.

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

@fgregg because the importer needs to be able to take negitive actions too, so we do need to know what the database should look like

This is a simple one that's easily solved, but take for example documents attached to a bill -- if we only scrape some of the related documents, we can't tell when we remove a document, since it might be missing since we didn't scrape it, or because it's gone.

We can scale that back to full collections, too. That's just an example, not the actual issue.

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

I've wanted to do this for years, and I have states I could do this on, so it's not not implemented due to a lack of wanting the feature, is all I'm saying

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

I think there's two issues here.

  1. How to have the scrape only scrape new or updated pages
  2. Only update the parts of the DB that need updating

This code seems to be about 2, but I'm talking about 1. These issues are obviously connected but not identical. Having the importer know about the DB makes a lot of sense, it already has too.

But for 1, it seems like the scraper also has to know some facts from the DB, and this is what I haven't seen before.

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

It's the same issue internally, 1 and 2. Passing something like the last scraped time is trivial, that's not the technical issue behind the end behavior your'e after. The real issue is 2 not 1.

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

Oh, okay, so there's no problem with having the scraper access the DB?

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

Also, FWIW, the scraper does not currently know about the DB in any way, it just writes to JSON to disk. Decoupling like that has been really great for us in the past.

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

If pupa knows it's doing an import, it can talk to the DB. The scraper must never talk to the DB under any conditions.

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

so how can I pass the last-scraped-time to the scraper?

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

Pupa, if it sees it's doing an import, brings up the Django connection before scrape, and can handle that. The scraper shouldn't.

@fgregg
Copy link
Contributor Author

fgregg commented Apr 21, 2015

okay, so where do I write the query that I want pupa to pass to the scraper?

@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

@fgregg
Copy link
Contributor Author

fgregg commented Apr 23, 2015

K. should this be split into a incremental db update issue and an issue for having pupa giving info like last_scraped to the scraper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants