Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agenda items not preserved when two scrapers collect same event #239

Closed
patcon opened this issue Jun 5, 2016 · 5 comments
Closed

Agenda items not preserved when two scrapers collect same event #239

patcon opened this issue Jun 5, 2016 · 5 comments

Comments

@patcon
Copy link
Contributor

patcon commented Jun 5, 2016

Reticketed from opencivicdata/scrapers-us-municipal#111


I have one scraper that scrapes all events scheduled in the legislative session, and another that runs for collecting details about recently passed and upcoming events (for which agendas are just being published).

So essentially:

  1. events-incremental. quick nightly scrape that builds events with agendas around the current date, and
  2. events-full. another scraper for the full schedule, that doesn't touch agendas.

I am seeing that if events-full runs after events-incremental it blows away all the agendas.

A slack conversation with @paultag lead me to believe that this shouldn't be happening if no agenda items are added or touched.

However, I've just built a simple scraper a reproduced the wiping behaviour:
https://github.com/patcon/scrapers-us-municipal/tree/agenda-wipe-bug-demo

If I run the below code, the agenda items appear, and then disappear after running the second scrape:

pupa update test events_has_agenda
pupa update test events_no_agenda

Can someone confirm that this is unexpected behaviour, and perhaps suggest any ideas on where the regression may have happened? Thanks! :)

@jpmckinney
Copy link
Member

I think this is dup #169

@patcon
Copy link
Contributor Author

patcon commented Feb 5, 2017

I don't think they're dups james. This one is suggesting unexpected behavior where rows from opencivicdata_eventagendaitems are removed/detached from events when a second scrape (with no agenda item data) is run. (it's been awhile, so I'm a little rusty on details, but that's the observation)

The linked issue is about what seems to be about a larger issue: not doing more scraping work than required, by reading from database during scrape process. I don't think it's solution would resolve this issue's concern.

@jpmckinney
Copy link
Member

Hmm, perhaps import_item is what needs to change to resolve this issue.

@jpmckinney jpmckinney reopened this Feb 5, 2017
@jpmckinney
Copy link
Member

jpmckinney commented Feb 5, 2017

Actually, re-read the other issue, and it seems relevant, like this comment.

The importer can't determine whether the scrape with no agenda items lacks them because it's an intentionally incomplete object, or because those agenda items were removed in the real world.

paultag's comments were too brief to read much into that Slack conversation. It seems to me that this is behaving as designed. I think the resolution would be for you to adjust Pupa's logic for your special case. Basically, you'd have to skip this logic if the data for the related model (in this case agenda) is empty.

I suppose a generic solution would be to have a flag on the data object to indicate which related_models should have the default 'clear everything' behavior if empty, and which should have the 'do nothing' behavior that you desire. I'm not sure there's appetite to maintain that additional complexity.

@patcon
Copy link
Contributor Author

patcon commented Feb 5, 2017

Ah ok, will admit that I had optimism after the paultag convo that this was a "simple" regression and the feature already existed, but i realize from that comment you linked that you're totally correct. Thanks for humouring me james!

@patcon patcon closed this as completed Feb 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants