Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data retention policy for crawled data #153

Closed
jpmckinney opened this issue May 6, 2020 · 14 comments
Closed

Data retention policy for crawled data #153

jpmckinney opened this issue May 6, 2020 · 14 comments
Labels
S: kingfisher-collect Relating to the Kingfisher Collect service

Comments

@jpmckinney
Copy link
Member

jpmckinney commented May 6, 2020

Follow-up to #155

This data is stored in /home/ocdskfs/scrapyd/data, which is the second largest use of disk space. (Largest is database. Related: open-contracting/kingfisher-process#269)

#154 proposes criteria for archival. Repeating here:

  • clean (relatively few request errors or other spider issues in the log files)
  • complete (i.e. not a test, sample, filtered, empty or unclosed collection)
  • 30 days after the last archived data for the same data source (e.g., if we have multiple collections in the same month, we'd archive only the first)

CRM-5533 has been used to delete collections from the database itself. In that issue, we mainly deleted:

  • Unclosed collections (can take up a lot of disk space)
  • Empty collections (take up very little disk space)

For the remaining collections, we can write code to apply the criteria:

  • Group collections by source (already done on disk).
  • If there is only one collection within a 30-day period, archive it, unless it is empty, unclosed, incomplete, or otherwise has a lot of errors (i.e. too unclean).
  • If there are multiple collections 30 days after the last archived data, take the earliest one, unless it fails the clean or complete criteria.
    • We might want the script to be interactive, giving the user an opportunity to decide which to archive. In that case, the script can report the sizes of collections, to help assess completeness. We'd want to do this at least once a month.
  • Data that is older than 90 days can be deleted. This can be a last step of the script that performs the archival.

Update: This archiving should include the relevant Scrapyd log file. #135

@jpmckinney
Copy link
Member Author

@yolile
Copy link
Member

yolile commented May 12, 2020

There are some collections, as the Paraguay and Colombia ones that are really big, around 30GB each (each version), so maybe in these cases we should remove the stored files just after they are already loaded in the database?

In general, I was wondering if any analyst uses the downloaded files...

@jpmckinney
Copy link
Member Author

I'm thinking the process would be:

  1. Download data with Kingfisher Collect
  2. Date is loaded into Kingfisher Process
  3. If the downloaded data meets the archival criteria, it is archived, in case we want to restore it at a future date for analysis. Once it is archived, it can be automatically deleted.
  4. If the downloaded data doesn't meet the archival criteria, it is retained for 90 days and then automatically deleted.
  5. For data that is loaded in Kingfisher Process, we'll have a separate policy, which would involve more analyst participation, since analysts might still need old collections.

@jpmckinney jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant May 20, 2020
@jpmckinney jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 20, 2020
@jpmckinney jpmckinney added S: kingfisher-collect Relating to the Kingfisher Collect service for: ODSC discussion and removed S: kingfisher Relating to the Kingfisher servers for: ODSC labels May 20, 2020
@jpmckinney
Copy link
Member Author

@duncandewhurst @mrshll1001 @romifz Any opinions of the data retention policy for crawled data (data collected by Scrapy – not the data in the database)?

@romifz
Copy link

romifz commented Jul 22, 2020

@jpmckinney your proposal sounds good for me. I agree with Yohanna, I don't think we have a use for the scraped data once it is successfully loaded into the database.

@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 22, 2020

@romifz Great. I am thinking we will eventually need to start deleting some collections from the database, in which case we will want backups of the scraped data – just in case we want to load it back in.

@duncandewhurst
Copy link
Contributor

Sounds good to me. I've not had to go back to data on disk for anything, but archival sounds sensible if we are going to be deleting data from the database.

@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 22, 2020

Thanks, @romifz @duncandewhurst!

I spent a little time looking into how to implement this, so updating here:

For the automated version of this script, it would list files in S3 to decide whether there is already a monthly/yearly backup, and it would also rotate out monthly backups from the previous year.

@robredpath
Copy link
Contributor

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 6, 2020

Having applied the criteria to the data from the old framework (ocdsdata) (see activity log in #150), I can now more specifically describe how to apply the criteria.

There are two use cases:

  1. Back up recently completed collections, in case a server goes down and the collection was not fully loaded into Kingfisher Process. That way, the collection can be restored for analysts to continue their work.
  2. Back up and rotate collections, so that we have a historical archive of OCDS publications, which may be of future interest and use.

To satisfy the first use case, the script should be run once a day. The process is:

  1. Read the crawl log files in /home/ocdskfs/scrapyd/logs (see Improve documentation on reading log files kingfisher-collect#198 for a library to reuse)
  2. List the data directories in /scrapyd/data
  3. Match each log to a directory
    1. If no directory matches, and the modification time of the log is at least 90 days ago, assume the crawl failed. Delete it. (DON'T DO THIS UNTIL WE'VE MATCHED LOGS TO ARCHIVED DATA)
    2. If any directory is unmatched, log a warning, so that an administrator can decide what to do, and skip it.
  4. If, according to the log, a crawl has not finished:
    1. If the modification time of the log is at least 90 days ago, assume the crawl failed. Delete it and its directory.
    2. Otherwise, skip it. Either the crawl is running, or we are keeping a failed crawl for debugging or salvaging.
  5. If, according to the log, a crawl is a subset of the dataset (e.g. from_date, until_date, sample, etc. were set), or if the directory is empty, skip it.
  6. If a crawl has a data directory and has finished according to its log, we're ready to apply the criteria (below).

Using the same library as for (1), as part of the research for this issue, we should catalog the errors that occur. Some are expected (FileError items), others are unexpected (exceptions, etc.). Once we have a catalog, we can determine whether they are all collection errors, or if some are unimportant errors (in terms of data collection) like an error response from Kingfisher Process. We can perhaps start by analyzing the stats (open-contracting/kingfisher-collect#198 (comment)) about errors, instead of parsing the errors themselves from the log. Ideally, the catalog of errors should be such that there is at most one error per file (so, for example, we wouldn't count retries).

Once that research is done, we can get the:

  1. Source ID (from directory name)
  2. Data version (from directory name)
  3. Year (from data version)
  4. Year-month (from data version)
  5. Checksum of all files (find . -type f -exec md5sum {} + | awk '{print $1}' | sort | md5sum (source)
  6. Number of errors in log
  7. Number of files
  8. Total size in bytes of all files (du -sb)

And store this in a small metadata file alongside the data directory (can be a single row in a comma- or space-delimited file).

For the following process, we need to get a list of directories in the archive, along with their metadata. We can do this by storing an "index" file in the archive that has all archived directories' metadata.

To apply the criteria, for each local metadata file:

  1. If it is "do not backup" (this comes up later), skip it.
  2. Clean: If the number of errors is excessive (>50% number of files), update the metadata file as "do not backup".
    • Rationale: A dataset that is more than half errors is not valuable to backup. For example, see paraguay_dncp 2019 in the spreadsheet.
  3. Find any directory in the archive with the same source ID and year-month.
  4. If there is a directory in the archive with the same source ID and year-month:
    1. Distinct: Compare the checksums. If identical, update the metadata file as "do not backup".
    2. Compare the number or errors, number of files, and total size in bytes.
    3. Complete: If the local directory has 50% more bytes OR 50% more files and greater or equal bytes, replace the remote directory.
      • Rationale: For there to be so many more bytes or files, the previous collection must have been incomplete.
      • See chile_compra_releases 2018 in the spreadsheet to see why the condition considers number of files, not only number of bytes.
      • This disregards differences in number of errors, because "complete" takes priority over "clean". For example, see australia 2019, indonesia_bandung 2018 in the spreadsheet.
    4. Clean: If the local directory has fewer errors, and greater or equal bytes, replace the remote directory.
      • This disregards the number of files, since the number of bytes is a better measure of completeness. For example, see australia_nsw 2019, honduras_sefin 2018 in the spreadsheet.
    5. Periodic: Otherwise, update the metadata file as "do not backup".
  5. If none found, get the metadata for the most recently archived directory with the same source ID.
    1. Distinct: Compare the checksums. If identical, update the metadata file as "do not backup".
    2. Clean: If the local directory has more errors, equal or fewer files, and equal or fewer bytes, update the metadata file as "do not backup".
      • Rationale: If not for the errors, the new collection might have been identical. For example, see moldova_old 2019, ukraine 2018, scotland_public_contracts 2019 in the spreadsheet (Update: the less clean scotland_public_contracts 2019 has more data, so it is not a good example).
    3. Periodic: Otherwise, backup.
      This will yield at most one backed up directory per source ID, per year-month.

There remains the policy of rotating backups annually. Once there is a new collection in a new year, we can:

  1. Collect the metadata for the same source ID in its previous year of publication. (This isn't necessarily the preceding year. For example, a publisher might publish in 2018, not publish in 2019, then resume in 2020.)
  2. Then, apply the same logic as for "If there is a directory in the archive with the same source ID and year-month" to each collections in that year in chronological order.
    • For example, if the first collection is clean, and no further collection is more than 50% more complete, then the first collection is kept. Or, if the first collection is unclean and another is cleaner (and at least as complete), then the latter is kept. Or, if a further collection is more than 50% more complete, then it is kept.
    • An exception is Digiwhist. Since Digiwhist only publishes once or twice a year, we should start with the last collection and go in reverse chronological order. Otherwise, the first collection risks being the same as the prior year.

@jpmckinney
Copy link
Member Author

Having written this all out, and performed similar tasks for #150, I think a technical approach might involve a Makefile to create the metadata files (since we can use find, du and md5sum very simply and easily), and then a Python script to do the comparisons, backups and replacements.

For Makefile style, see script patterns.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 6, 2020

Noting here that if we rename a source in Kingfisher Collect, we should rename the backups in the archive accordingly. (This came up a few times in #150 ,e.g. scotland became scotland_public_contracts, chile_compra became chile_compra_releases). We can add this to eventual documentation (open-contracting-archive/kingfisher-archive#15).

@jpmckinney
Copy link
Member Author

jpmckinney commented Nov 13, 2020

Regarding:

Rationale: If not for the errors, the new collection might have been identical.

It's possible that a publisher changes their publication pattern and decreases the size of their publication (in number of bytes or files), while having a non-zero error rate, in which case the data might have changed, but our policy would not back it up.

We could avoid this by storing all file names, sizes and checksums, and if an earlier backup contains the newer backup, then we don't backup.

On the other hand, some publishers change dates according to the current timestamp, which causes checksums but not sizes to change. In the updated policy, these would always be backed up.

I think we can keep the current policy for now, as this is only an issue if a source continuously has more errors in new collections than in the last archived collection.

@jpmckinney
Copy link
Member Author

To date, we have never restored any data from the archive server. As such, to simplify our data processes and limit our total software, Kingfisher Archive will be abandoned. With the data registry, we will have regular collections of all published data, which will be stored in their final compiled form. If necessary, it would be easy enough to retain a history (the current needs are just to have the current/previous collection). With Dogsbody, we are about to deploy database backups for the Kingfisher Process server #254. This will have daily incremental backups, and weekly full backups for 5 weeks. These two cover our established needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: kingfisher-collect Relating to the Kingfisher Collect service
Projects
None yet
Development

No branches or pull requests

5 participants