Data retention policy for crawled data #153

jpmckinney · 2020-05-06T18:50:58Z

Follow-up to #155

This data is stored in /home/ocdskfs/scrapyd/data, which is the second largest use of disk space. (Largest is database. Related: open-contracting/kingfisher-process#269)

#154 proposes criteria for archival. Repeating here:

clean (relatively few request errors or other spider issues in the log files)
complete (i.e. not a test, sample, filtered, empty or unclosed collection)
30 days after the last archived data for the same data source (e.g., if we have multiple collections in the same month, we'd archive only the first)

CRM-5533 has been used to delete collections from the database itself. In that issue, we mainly deleted:

Unclosed collections (can take up a lot of disk space)
Empty collections (take up very little disk space)

For the remaining collections, we can write code to apply the criteria:

Group collections by source (already done on disk).
If there is only one collection within a 30-day period, archive it, unless it is empty, unclosed, incomplete, or otherwise has a lot of errors (i.e. too unclean).
- This will require parsing the related log file. See Improve documentation on reading log files kingfisher-collect#198 (comment) onwards.
If there are multiple collections 30 days after the last archived data, take the earliest one, unless it fails the clean or complete criteria.
- We might want the script to be interactive, giving the user an opportunity to decide which to archive. In that case, the script can report the sizes of collections, to help assess completeness. We'd want to do this at least once a month.
Data that is older than 90 days can be deleted. This can be a last step of the script that performs the archival.

Update: This archiving should include the relevant Scrapyd log file. #135

The text was updated successfully, but these errors were encountered:

jpmckinney · 2020-05-06T18:53:26Z

Hosted Kingfisher - Review archiving procedures for scraped data

yolile · 2020-05-12T13:38:27Z

There are some collections, as the Paraguay and Colombia ones that are really big, around 30GB each (each version), so maybe in these cases we should remove the stored files just after they are already loaded in the database?

In general, I was wondering if any analyst uses the downloaded files...

jpmckinney · 2020-05-12T15:05:02Z

I'm thinking the process would be:

Download data with Kingfisher Collect
Date is loaded into Kingfisher Process
If the downloaded data meets the archival criteria, it is archived, in case we want to restore it at a future date for analysis. Once it is archived, it can be automatically deleted.
If the downloaded data doesn't meet the archival criteria, it is retained for 90 days and then automatically deleted.
For data that is loaded in Kingfisher Process, we'll have a separate policy, which would involve more analyst participation, since analysts might still need old collections.

jpmckinney · 2020-07-22T01:57:54Z

@duncandewhurst @mrshll1001 @romifz Any opinions of the data retention policy for crawled data (data collected by Scrapy – not the data in the database)?

romifz · 2020-07-22T03:00:56Z

@jpmckinney your proposal sounds good for me. I agree with Yohanna, I don't think we have a use for the scraped data once it is successfully loaded into the database.

jpmckinney · 2020-07-22T03:17:29Z

@romifz Great. I am thinking we will eventually need to start deleting some collections from the database, in which case we will want backups of the scraped data – just in case we want to load it back in.

duncandewhurst · 2020-07-22T03:27:14Z

Sounds good to me. I've not had to go back to data on disk for anything, but archival sounds sensible if we are going to be deleting data from the database.

jpmckinney · 2020-07-22T03:55:03Z

Thanks, @romifz @duncandewhurst!

I spent a little time looking into how to implement this, so updating here:

Add a management script (e.g. manage.py using Click) to kingfisher-archive that:
- Reads the crawl logs (see Improve documentation on reading log files kingfisher-collect#198 for a library to reuse)
- Matches the logs with the data directories
  - For logs without data, we can delete the log
  - For data without logs, we'll need to decide what to do
- For matches, applies the criteria (TBD), then:
  - Backs up the data (depends on Switch to S3 instead of archiving to a server open-contracting-archive/kingfisher-archive#10)
  - Requests confirmation to delete the data (when running manually)

For the automated version of this script, it would list files in S3 to decide whether there is already a monthly/yearly backup, and it would also rotate out monthly backups from the previous year.

robredpath · 2020-08-05T15:24:43Z

Archive to S3 and decommission server

jpmckinney · 2020-08-06T20:52:57Z

Having applied the criteria to the data from the old framework (ocdsdata) (see activity log in #150), I can now more specifically describe how to apply the criteria.

There are two use cases:

Back up recently completed collections, in case a server goes down and the collection was not fully loaded into Kingfisher Process. That way, the collection can be restored for analysts to continue their work.
Back up and rotate collections, so that we have a historical archive of OCDS publications, which may be of future interest and use.

~~To satisfy the first use case, the script should be run once a day.~~ The process is:

Read the crawl log files in /home/ocdskfs/scrapyd/logs (see Improve documentation on reading log files kingfisher-collect#198 for a library to reuse)
List the data directories in /scrapyd/data
Match each log to a directory
1. If no directory matches, and the modification time of the log is at least 90 days ago, assume the crawl failed. Delete it. (DON'T DO THIS UNTIL WE'VE MATCHED LOGS TO ARCHIVED DATA)
2. If any directory is unmatched, log a warning, so that an administrator can decide what to do, and skip it.
If, according to the log, a crawl has not finished:
1. If the modification time of the log is at least 90 days ago, assume the crawl failed. Delete it and its directory.
2. Otherwise, skip it. Either the crawl is running, or we are keeping a failed crawl for debugging or salvaging.
If, according to the log, a crawl is a subset of the dataset (e.g. from_date, until_date, sample, etc. were set), or if the directory is empty, skip it.
If a crawl has a data directory and has finished according to its log, we're ready to apply the criteria (below).

Using the same library as for (1), as part of the research for this issue, we should catalog the errors that occur. Some are expected (FileError items), others are unexpected (exceptions, etc.). Once we have a catalog, we can determine whether they are all collection errors, or if some are unimportant errors (in terms of data collection) like an error response from Kingfisher Process. We can perhaps start by analyzing the stats (open-contracting/kingfisher-collect#198 (comment)) about errors, instead of parsing the errors themselves from the log. Ideally, the catalog of errors should be such that there is at most one error per file (so, for example, we wouldn't count retries).

Once that research is done, we can get the:

Source ID (from directory name)
Data version (from directory name)
Year (from data version)
Year-month (from data version)
Checksum of all files (find . -type f -exec md5sum {} + | awk '{print $1}' | sort | md5sum (source)
Number of errors in log
Number of files
Total size in bytes of all files (du -sb)

And store this in a small metadata file alongside the data directory (can be a single row in a comma- or space-delimited file).

For the following process, we need to get a list of directories in the archive, along with their metadata. We can do this by storing an "index" file in the archive that has all archived directories' metadata.

To apply the criteria, for each local metadata file:

If it is "do not backup" (this comes up later), skip it.
Clean: If the number of errors is excessive (>50% number of files), update the metadata file as "do not backup".
- Rationale: A dataset that is more than half errors is not valuable to backup. For example, see paraguay_dncp 2019 in the spreadsheet.
Find any directory in the archive with the same source ID and year-month.
If there is a directory in the archive with the same source ID and year-month:
1. Distinct: Compare the checksums. If identical, update the metadata file as "do not backup".
2. Compare the number or errors, number of files, and total size in bytes.
3. Complete: If the local directory has 50% more bytes OR 50% more files and greater or equal bytes, replace the remote directory.
  - Rationale: For there to be so many more bytes or files, the previous collection must have been incomplete.
  - See chile_compra_releases 2018 in the spreadsheet to see why the condition considers number of files, not only number of bytes.
  - This disregards differences in number of errors, because "complete" takes priority over "clean". For example, see australia 2019, indonesia_bandung 2018 in the spreadsheet.
4. Clean: If the local directory has fewer errors, and greater or equal bytes, replace the remote directory.
  - This disregards the number of files, since the number of bytes is a better measure of completeness. For example, see australia_nsw 2019, honduras_sefin 2018 in the spreadsheet.
5. Periodic: Otherwise, update the metadata file as "do not backup".
If none found, get the metadata for the most recently archived directory with the same source ID.
1. Distinct: Compare the checksums. If identical, update the metadata file as "do not backup".
2. Clean: If the local directory has more errors, equal or fewer files, and equal or fewer bytes, update the metadata file as "do not backup".
  - Rationale: If not for the errors, the new collection might have been identical. For example, see moldova_old 2019, ukraine 2018, scotland_public_contracts 2019 in the spreadsheet (Update: the less clean scotland_public_contracts 2019 has more data, so it is not a good example).
3. Periodic: Otherwise, backup.
  This will yield at most one backed up directory per source ID, per year-month.

There remains the policy of rotating backups annually. Once there is a new collection in a new year, we can:

Collect the metadata for the same source ID in its previous year of publication. (This isn't necessarily the preceding year. For example, a publisher might publish in 2018, not publish in 2019, then resume in 2020.)
Then, apply the same logic as for "If there is a directory in the archive with the same source ID and year-month" to each collections in that year in chronological order.
- For example, if the first collection is clean, and no further collection is more than 50% more complete, then the first collection is kept. Or, if the first collection is unclean and another is cleaner (and at least as complete), then the latter is kept. Or, if a further collection is more than 50% more complete, then it is kept.
- An exception is Digiwhist. Since Digiwhist only publishes once or twice a year, we should start with the last collection and go in reverse chronological order. Otherwise, the first collection risks being the same as the prior year.

jpmckinney · 2020-08-06T20:55:52Z

Having written this all out, and performed similar tasks for #150, I think a technical approach might involve a Makefile to create the metadata files (since we can use find, du and md5sum very simply and easily), and then a Python script to do the comparisons, backups and replacements.

For Makefile style, see script patterns.

jpmckinney · 2020-08-06T21:11:17Z

Noting here that if we rename a source in Kingfisher Collect, we should rename the backups in the archive accordingly. (This came up a few times in #150 ,e.g. scotland became scotland_public_contracts, chile_compra became chile_compra_releases). We can add this to eventual documentation (open-contracting-archive/kingfisher-archive#15).

jpmckinney · 2020-11-13T20:41:21Z

Regarding:

Rationale: If not for the errors, the new collection might have been identical.

It's possible that a publisher changes their publication pattern and decreases the size of their publication (in number of bytes or files), while having a non-zero error rate, in which case the data might have changed, but our policy would not back it up.

We could avoid this by storing all file names, sizes and checksums, and if an earlier backup contains the newer backup, then we don't backup.

On the other hand, some publishers change dates according to the current timestamp, which causes checksums but not sizes to change. In the updated policy, these would always be backed up.

I think we can keep the current policy for now, as this is only an issue if a source continuously has more errors in new collections than in the last archived collection.

jpmckinney · 2021-05-05T19:52:40Z

To date, we have never restored any data from the archive server. As such, to simplify our data processes and limit our total software, Kingfisher Archive will be abandoned. With the data registry, we will have regular collections of all published data, which will be stored in their final compiled form. If necessary, it would be easy enough to retain a history (the current needs are just to have the current/previous collection). With Dogsbody, we are about to deploy database backups for the Kingfisher Process server #254. This will have daily incremental backups, and weekly full backups for 5 weeks. These two cover our established needs.

jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant May 20, 2020

jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 20, 2020

robredpath mentioned this issue May 20, 2020

Service continuity + backup requirements for hosted kingfisher service #155

Closed

jpmckinney added S: kingfisher-collect Relating to the Kingfisher Collect service for: ODSC discussion and removed S: kingfisher Relating to the Kingfisher servers for: ODSC labels May 20, 2020

jpmckinney removed the discussion label Jul 22, 2020

jpmckinney mentioned this issue Jul 30, 2020

Kingfisher Archive: Disk clean up #150

Closed

7 tasks

jpmckinney mentioned this issue Sep 8, 2020

Backup the Scrapyd logs #135

Closed

jpmckinney closed this as completed May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data retention policy for crawled data #153

Data retention policy for crawled data #153

jpmckinney commented May 6, 2020 •

edited

Loading

jpmckinney commented May 6, 2020

yolile commented May 12, 2020

jpmckinney commented May 12, 2020

jpmckinney commented Jul 22, 2020

romifz commented Jul 22, 2020

jpmckinney commented Jul 22, 2020 •

edited

Loading

duncandewhurst commented Jul 22, 2020

jpmckinney commented Jul 22, 2020 •

edited

Loading

robredpath commented Aug 5, 2020

jpmckinney commented Aug 6, 2020 •

edited

Loading

jpmckinney commented Aug 6, 2020

jpmckinney commented Aug 6, 2020 •

edited

Loading

jpmckinney commented Nov 13, 2020 •

edited

Loading

jpmckinney commented May 5, 2021

Data retention policy for crawled data #153

Data retention policy for crawled data #153

Comments

jpmckinney commented May 6, 2020 • edited Loading

jpmckinney commented May 6, 2020

yolile commented May 12, 2020

jpmckinney commented May 12, 2020

jpmckinney commented Jul 22, 2020

romifz commented Jul 22, 2020

jpmckinney commented Jul 22, 2020 • edited Loading

duncandewhurst commented Jul 22, 2020

jpmckinney commented Jul 22, 2020 • edited Loading

robredpath commented Aug 5, 2020

jpmckinney commented Aug 6, 2020 • edited Loading

jpmckinney commented Aug 6, 2020

jpmckinney commented Aug 6, 2020 • edited Loading

jpmckinney commented Nov 13, 2020 • edited Loading

jpmckinney commented May 5, 2021

jpmckinney commented May 6, 2020 •

edited

Loading

jpmckinney commented Jul 22, 2020 •

edited

Loading

jpmckinney commented Jul 22, 2020 •

edited

Loading

jpmckinney commented Aug 6, 2020 •

edited

Loading

jpmckinney commented Aug 6, 2020 •

edited

Loading

jpmckinney commented Nov 13, 2020 •

edited

Loading