Incremental updates #169

fgregg · 2015-04-21T17:45:17Z

Chicago has a lot of legislation and long legislative sessions (4 years). This means that it currently takes about 48 hours to scrape the site every week.

There are some strategies for a smarter scraping strategy but they all require that the scraper know, to some degree, what's already in the database.

Right now, the scraper doesn't know anything about the database, and that seems like it has been a good and sane thing. How should we proceed?

One idea is for the scraper to maybe hit the OCD API? Is that a looser coupling? Thoughts please!

paultag · 2015-04-21T17:47:15Z

We've been calling this 'incremental' updates internally.

CC'ing @boblannon

fgregg · 2015-04-21T17:47:34Z

Some context from opencivicdata/scrapers-us-municipal#25 (comment) about how we could scraper smarter:

Two ideas:

We could leverage the legistar API, which has as last updated search parameter. But we would still need to know when the last time we scraped was. (which pupa does not seem to currently have a way to know).
On the website the search results seem to be returned in last updated order. We could stop scraping once the order of of a window of 100 bills on the website matches a windows of 100 bills in the DB. This seems like the most generic solution, since other cities do not have a legistar API.

boblannon · 2015-04-21T18:30:38Z

The way I've been doing this on my fork was to do my best to make scrapes idempotent. the goal was for a scrape to be all no-ops if an identical scrape had already been done.

I did it mostly by narrowing the get_object calls using an object's source. This happens if the object's (new) source_identified property is True. I've done this for...

It works pretty well, even when I re-run a scrape after having merged people or organizations.

paultag · 2015-04-21T18:34:43Z

FWIW; stock Pupa scrapes are more idempotent than incremental scrapers, since we don't have to rely on carying state without full rescrapes to ensure state

fgregg · 2015-04-21T18:38:58Z

Don't see see how that importer code will reduce the number of pages the scraper will visit. Sorry if I'm being dense.

paultag · 2015-04-21T18:45:07Z

@fgregg because the importer needs to be able to take negitive actions too, so we do need to know what the database should look like

This is a simple one that's easily solved, but take for example documents attached to a bill -- if we only scrape some of the related documents, we can't tell when we remove a document, since it might be missing since we didn't scrape it, or because it's gone.

We can scale that back to full collections, too. That's just an example, not the actual issue.

paultag · 2015-04-21T18:45:44Z

I've wanted to do this for years, and I have states I could do this on, so it's not not implemented due to a lack of wanting the feature, is all I'm saying

fgregg · 2015-04-21T18:48:59Z

I think there's two issues here.

How to have the scrape only scrape new or updated pages
Only update the parts of the DB that need updating

This code seems to be about 2, but I'm talking about 1. These issues are obviously connected but not identical. Having the importer know about the DB makes a lot of sense, it already has too.

But for 1, it seems like the scraper also has to know some facts from the DB, and this is what I haven't seen before.

paultag · 2015-04-21T18:50:30Z

It's the same issue internally, 1 and 2. Passing something like the last scraped time is trivial, that's not the technical issue behind the end behavior your'e after. The real issue is 2 not 1.

fgregg · 2015-04-21T18:51:20Z

Oh, okay, so there's no problem with having the scraper access the DB?

paultag · 2015-04-21T18:51:22Z

Also, FWIW, the scraper does not currently know about the DB in any way, it just writes to JSON to disk. Decoupling like that has been really great for us in the past.

paultag · 2015-04-21T18:52:01Z

If pupa knows it's doing an import, it can talk to the DB. The scraper must never talk to the DB under any conditions.

fgregg · 2015-04-21T18:52:23Z

so how can I pass the last-scraped-time to the scraper?

paultag · 2015-04-21T18:52:44Z

Pupa, if it sees it's doing an import, brings up the Django connection before scrape, and can handle that. The scraper shouldn't.

fgregg · 2015-04-21T18:54:13Z

okay, so where do I write the query that I want pupa to pass to the scraper?

paultag · 2015-04-21T18:57:15Z

We've moved to IRC; link for postarity: https://github.com/opencivicdata/pupa/blob/master/pupa/cli/commands/update.py#L223-L226

fgregg · 2015-04-23T18:28:41Z

K. should this be split into a incremental db update issue and an issue for having pupa giving info like last_scraped to the scraper?

fgregg mentioned this issue Apr 21, 2015

Chicago: bugfixes opencivicdata/scrapers-us-municipal#25

Merged

paultag added proposed bug labels Apr 21, 2015

paultag added this to the singularity milestone Apr 21, 2015

fgregg changed the title ~~Smart updates~~ Incremental updates Apr 21, 2015

jamesturk removed this from the singularity milestone Sep 24, 2015

jamesturk removed the bug label Oct 7, 2015

jpmckinney mentioned this issue Feb 3, 2017

Agenda items not preserved when two scrapers collect same event #239

Closed

showerst mentioned this issue Mar 16, 2017

Incremental Updates -- Scraper Piece #269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental updates #169

Incremental updates #169

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

boblannon commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 23, 2015

Incremental updates #169

Incremental updates #169

Comments

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

boblannon commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 23, 2015