Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow people to resume scrapers that have experienced intermittent errors #79

Closed
ghost opened this issue Jun 20, 2018 · 12 comments
Closed
Labels
discussion framework Relating to other common functionality

Comments

@ghost
Copy link

ghost commented Jun 20, 2018

Old title: Armenia download died after downloading 2316 files with a generic ["Connection error"] error.

Found during #100 work

The URL - https://armeps.am/ocds/release?limit=100&offset=1519400579553 - opens fine in a browser - maybe just a temp glitch?

@ghost
Copy link
Author

ghost commented Jun 21, 2018

I manually edited the SQLite database and cleared the error and fetch_finished_datetime. I then ran it again, and it picked up fine and is now on file 2329. So it was just an intermittent error.

So this issue could become - how can we stop intermittent errors killing a download entirely?

We already have several retries, but maybe we need more, or more sleep between retries? Or maybe after an operator has examined the problem, a tool so they can easily say "I think that is intermittent, just clear all fetch errors and try again"?

@robredpath
Copy link
Contributor

For relatively short-running scrapers, it's probably fine to just start over. For longer-running scrapers, I can see this being a really useful thing to have!

@robredpath robredpath changed the title Armenia download died after downloading 2316 files with a generic ["Connection error"] error. Allow people to resume scrapers that have experienced intermittent errors Jun 22, 2018
@tian2992
Copy link

tian2992 commented Jun 22, 2018

@yolile also commented about this;

I think a (different) approach would require the digestor to extract "item" per item, which then would be processed (inserted, validated, etc). Seems like a significant refactor but would enable even more parallelism as well as induce more complexity. Would that merit its own issue branch of course, if it were to be considered...

@robredpath
Copy link
Contributor

I think the priority here is making sure that we're able to re-run scrapers and have them just 'fill in the gaps' if there was some kind of issue with the system that they're talking to. From the analyst's perspective, getting the data loaded and being confident that it's a true reflection of what the API/other system is serving are the important things. Not having to babysit too much is also important.

If we can make them faster (eg through more parallelism) for a relatively low cost at the same time, I'm happy for that to be in scope, but not if it's a large job.

@ghost
Copy link
Author

ghost commented Sep 6, 2018

For this, I'm seeing a new sub command.

It would take a run that is currently in a fetch stage with some errors, and clear those errors from the meta DB.

The operator would run the new command to clear errors.

They would then run the normal run command again to try again.

ps. parallelism is a different issue

@jpmckinney
Copy link
Member

jpmckinney commented Sep 6, 2018

Why a new subcommand and not an option on the run subcommand?

For the use case of, "the run is broken – please just let me reset and start over", I can see the sense of a new subcommand (though, it can also just be an option on the run subcommand). This is separate from other use cases like, "I limited the initial run to 1,000 pages of API results, and I now want to resume from page 1001" or "I had to stop the run for some reason (closed my laptop, lost internet connection, etc.), and I now want to resume where it left off." In those cases, it seems better to add options to the run subcommand, since resuming a download is not fundamentally different from running a download (from scratch) in a user's conceptual framework.

The title of this issue is about 'resuming', but I think it's more accurately about 'starting over', in which case a separate 'reset' (or similar) subcommand might be a solution. We should probably rename this issue and split out 'resuming' into a separate issue (I haven't checked if one already exists).

@ghost ghost unassigned tian2992 Nov 9, 2018
@jpmckinney
Copy link
Member

Is this possible in Scrapy?

@jpmckinney jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant Jan 31, 2019
@jpmckinney jpmckinney added old-kingfisher framework Relating to other common functionality and removed old-kingfisher labels Jan 31, 2019
@jpmckinney
Copy link
Member

@odscjames @yolile With Scrapy, is it possible to resume scrapers that have experienced intermittent errors, or is this still an issue that needs to be resolved?

@ghost
Copy link
Author

ghost commented Jun 12, 2019

The use case here is runs that have errors, and we try again in a day or something similar to see if those errors have cleared.

We would need to look into how that would work with Scrapy - this isn't something that works currently.

Some of our spiders do manually do things around this - https://github.com/open-contracting/kingfisher-scrape/blob/master/kingfisher_scrapy/spiders/colombia.py#L32 - but we should look into this.

@jpmckinney
Copy link
Member

jpmckinney commented Dec 2, 2019

Assuming the source has a publication pattern that allows resuming:

If the source takes a long time to collect or contains a lot of data, another use case for resuming is to get any new releases since the last collection. This is relevant to a (non-helpdesk) data analyst who is working with the same source(s) over long periods of time and/or at frequent intervals (e.g. daily), and who doesn't want or can't store multiple copies of the same data.

This makes a critical assumption: old releases aren't changed or deleted (this is required by the standard, but a source can be nonconformant).

If a second assumption holds – that only new releases (i.e. those with a more recent date) are added over time – then instead of re-compiling all releases, the previously compiled releases can simply be updated with the new releases (this can be done by putting the compiled release as the first release in the list before merging).

@jpmckinney
Copy link
Member

Split into two issues above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion framework Relating to other common functionality
Projects
None yet
Development

No branches or pull requests

3 participants