Allow people to resume scrapers that have experienced intermittent errors #79

ghost · 2018-06-20T15:19:48Z

Old title: Armenia download died after downloading 2316 files with a generic ["Connection error"] error.

Found during #100 work

The URL - https://armeps.am/ocds/release?limit=100&offset=1519400579553 - opens fine in a browser - maybe just a temp glitch?

ghost · 2018-06-21T10:34:01Z

I manually edited the SQLite database and cleared the error and fetch_finished_datetime. I then ran it again, and it picked up fine and is now on file 2329. So it was just an intermittent error.

So this issue could become - how can we stop intermittent errors killing a download entirely?

We already have several retries, but maybe we need more, or more sleep between retries? Or maybe after an operator has examined the problem, a tool so they can easily say "I think that is intermittent, just clear all fetch errors and try again"?

robredpath · 2018-06-21T14:19:31Z

For relatively short-running scrapers, it's probably fine to just start over. For longer-running scrapers, I can see this being a really useful thing to have!

tian2992 · 2018-06-22T18:23:03Z

@yolile also commented about this;

I think a (different) approach would require the digestor to extract "item" per item, which then would be processed (inserted, validated, etc). Seems like a significant refactor but would enable even more parallelism as well as induce more complexity. Would that merit its own issue branch of course, if it were to be considered...

robredpath · 2018-06-25T09:02:16Z

I think the priority here is making sure that we're able to re-run scrapers and have them just 'fill in the gaps' if there was some kind of issue with the system that they're talking to. From the analyst's perspective, getting the data loaded and being confident that it's a true reflection of what the API/other system is serving are the important things. Not having to babysit too much is also important.

If we can make them faster (eg through more parallelism) for a relatively low cost at the same time, I'm happy for that to be in scope, but not if it's a large job.

ghost · 2018-09-06T17:28:54Z

For this, I'm seeing a new sub command.

It would take a run that is currently in a fetch stage with some errors, and clear those errors from the meta DB.

The operator would run the new command to clear errors.

They would then run the normal run command again to try again.

ps. parallelism is a different issue

jpmckinney · 2018-09-06T18:16:48Z

Why a new subcommand and not an option on the run subcommand?

For the use case of, "the run is broken – please just let me reset and start over", I can see the sense of a new subcommand (though, it can also just be an option on the run subcommand). This is separate from other use cases like, "I limited the initial run to 1,000 pages of API results, and I now want to resume from page 1001" or "I had to stop the run for some reason (closed my laptop, lost internet connection, etc.), and I now want to resume where it left off." In those cases, it seems better to add options to the run subcommand, since resuming a download is not fundamentally different from running a download (from scratch) in a user's conceptual framework.

The title of this issue is about 'resuming', but I think it's more accurately about 'starting over', in which case a separate 'reset' (or similar) subcommand might be a solution. We should probably rename this issue and split out 'resuming' into a separate issue (I haven't checked if one already exists).

jpmckinney · 2019-01-31T17:08:01Z

Is this possible in Scrapy?

jpmckinney · 2019-06-11T15:15:58Z

@odscjames @yolile With Scrapy, is it possible to resume scrapers that have experienced intermittent errors, or is this still an issue that needs to be resolved?

ghost · 2019-06-12T16:25:24Z

The use case here is runs that have errors, and we try again in a day or something similar to see if those errors have cleared.

We would need to look into how that would work with Scrapy - this isn't something that works currently.

Some of our spiders do manually do things around this - https://github.com/open-contracting/kingfisher-scrape/blob/master/kingfisher_scrapy/spiders/colombia.py#L32 - but we should look into this.

jpmckinney · 2019-12-02T23:30:19Z

Assuming the source has a publication pattern that allows resuming:

If the source takes a long time to collect or contains a lot of data, another use case for resuming is to get any new releases since the last collection. This is relevant to a (non-helpdesk) data analyst who is working with the same source(s) over long periods of time and/or at frequent intervals (e.g. daily), and who doesn't want or can't store multiple copies of the same data.

This makes a critical assumption: old releases aren't changed or deleted (this is required by the standard, but a source can be nonconformant).

If a second assumption holds – that only new releases (i.e. those with a more recent date) are added over time – then instead of re-compiling all releases, the previously compiled releases can simply be updated with the new releases (this can be done by putting the compiled release as the first release in the list before merging).

jpmckinney · 2020-02-07T15:54:26Z

Some relevant Scrapy docs: https://docs.scrapy.org/en/latest/topics/jobs.html

Can also look into how the following maintain state:

jpmckinney · 2020-09-08T15:29:33Z

Split into two issues above.

robredpath changed the title ~~Armenia download died after downloading 2316 files with a generic ["Connection error"] error.~~ Allow people to resume scrapers that have experienced intermittent errors Jun 22, 2018

robredpath assigned tian2992 Jun 28, 2018

ghost unassigned tian2992 Nov 9, 2018

jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant Jan 31, 2019

jpmckinney added old-kingfisher framework Relating to other common functionality and removed old-kingfisher labels Jan 31, 2019

jpmckinney added the to be closed? label Apr 2, 2019

jpmckinney removed the to be closed? label Jun 12, 2019

jpmckinney added enhancement discussion labels Jan 30, 2020

jpmckinney removed the enhancement label Feb 7, 2020

jpmckinney mentioned this issue Apr 16, 2020

Research, doc, put into salt - keeping state across scrapyd restarts open-contracting/deploy#88

Closed

This was referenced Sep 8, 2020

Document incremental crawls #484

Closed

Scrapy job persistence #485

Closed

jpmckinney closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow people to resume scrapers that have experienced intermittent errors #79

Allow people to resume scrapers that have experienced intermittent errors #79

ghost commented Jun 20, 2018

ghost commented Jun 21, 2018

robredpath commented Jun 21, 2018

tian2992 commented Jun 22, 2018 •

edited

Loading

robredpath commented Jun 25, 2018

ghost commented Sep 6, 2018

jpmckinney commented Sep 6, 2018 •

edited

Loading

jpmckinney commented Jan 31, 2019

jpmckinney commented Jun 11, 2019

ghost commented Jun 12, 2019

jpmckinney commented Dec 2, 2019 •

edited

Loading

jpmckinney commented Feb 7, 2020 •

edited

Loading

jpmckinney commented Sep 8, 2020

Allow people to resume scrapers that have experienced intermittent errors #79

Allow people to resume scrapers that have experienced intermittent errors #79

Comments

ghost commented Jun 20, 2018

ghost commented Jun 21, 2018

robredpath commented Jun 21, 2018

tian2992 commented Jun 22, 2018 • edited Loading

robredpath commented Jun 25, 2018

ghost commented Sep 6, 2018

jpmckinney commented Sep 6, 2018 • edited Loading

jpmckinney commented Jan 31, 2019

jpmckinney commented Jun 11, 2019

ghost commented Jun 12, 2019

jpmckinney commented Dec 2, 2019 • edited Loading

jpmckinney commented Feb 7, 2020 • edited Loading

jpmckinney commented Sep 8, 2020

tian2992 commented Jun 22, 2018 •

edited

Loading

jpmckinney commented Sep 6, 2018 •

edited

Loading

jpmckinney commented Dec 2, 2019 •

edited

Loading

jpmckinney commented Feb 7, 2020 •

edited

Loading