-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data retention policy for crawled data #153
Comments
There are some collections, as the Paraguay and Colombia ones that are really big, around 30GB each (each version), so maybe in these cases we should remove the stored files just after they are already loaded in the database? In general, I was wondering if any analyst uses the downloaded files... |
I'm thinking the process would be:
|
@duncandewhurst @mrshll1001 @romifz Any opinions of the data retention policy for crawled data (data collected by Scrapy – not the data in the database)? |
@jpmckinney your proposal sounds good for me. I agree with Yohanna, I don't think we have a use for the scraped data once it is successfully loaded into the database. |
@romifz Great. I am thinking we will eventually need to start deleting some collections from the database, in which case we will want backups of the scraped data – just in case we want to load it back in. |
Sounds good to me. I've not had to go back to data on disk for anything, but archival sounds sensible if we are going to be deleting data from the database. |
Thanks, @romifz @duncandewhurst! I spent a little time looking into how to implement this, so updating here:
For the automated version of this script, it would list files in S3 to decide whether there is already a monthly/yearly backup, and it would also rotate out monthly backups from the previous year. |
Having applied the criteria to the data from the old framework (ocdsdata) (see activity log in #150), I can now more specifically describe how to apply the criteria. There are two use cases:
Using the same library as for (1), as part of the research for this issue, we should catalog the errors that occur. Some are expected (FileError items), others are unexpected (exceptions, etc.). Once we have a catalog, we can determine whether they are all collection errors, or if some are unimportant errors (in terms of data collection) like an error response from Kingfisher Process. We can perhaps start by analyzing the stats (open-contracting/kingfisher-collect#198 (comment)) about errors, instead of parsing the errors themselves from the log. Ideally, the catalog of errors should be such that there is at most one error per file (so, for example, we wouldn't count retries). Once that research is done, we can get the:
And store this in a small metadata file alongside the data directory (can be a single row in a comma- or space-delimited file). For the following process, we need to get a list of directories in the archive, along with their metadata. We can do this by storing an "index" file in the archive that has all archived directories' metadata. To apply the criteria, for each local metadata file:
There remains the policy of rotating backups annually. Once there is a new collection in a new year, we can:
|
Having written this all out, and performed similar tasks for #150, I think a technical approach might involve a Makefile to create the metadata files (since we can use For Makefile style, see script patterns. |
Noting here that if we rename a source in Kingfisher Collect, we should rename the backups in the archive accordingly. (This came up a few times in #150 ,e.g. scotland became scotland_public_contracts, chile_compra became chile_compra_releases). We can add this to eventual documentation (open-contracting-archive/kingfisher-archive#15). |
Regarding:
It's possible that a publisher changes their publication pattern and decreases the size of their publication (in number of bytes or files), while having a non-zero error rate, in which case the data might have changed, but our policy would not back it up. We could avoid this by storing all file names, sizes and checksums, and if an earlier backup contains the newer backup, then we don't backup. On the other hand, some publishers change dates according to the current timestamp, which causes checksums but not sizes to change. In the updated policy, these would always be backed up. I think we can keep the current policy for now, as this is only an issue if a source continuously has more errors in new collections than in the last archived collection. |
To date, we have never restored any data from the archive server. As such, to simplify our data processes and limit our total software, Kingfisher Archive will be abandoned. With the data registry, we will have regular collections of all published data, which will be stored in their final compiled form. If necessary, it would be easy enough to retain a history (the current needs are just to have the current/previous collection). With Dogsbody, we are about to deploy database backups for the Kingfisher Process server #254. This will have daily incremental backups, and weekly full backups for 5 weeks. These two cover our established needs. |
Follow-up to #155
This data is stored in
/home/ocdskfs/scrapyd/data
, which is the second largest use of disk space. (Largest is database. Related: open-contracting/kingfisher-process#269)#154 proposes criteria for archival. Repeating here:
CRM-5533 has been used to delete collections from the database itself. In that issue, we mainly deleted:
For the remaining collections, we can write code to apply the criteria:
Update: This archiving should include the relevant Scrapyd log file. #135
The text was updated successfully, but these errors were encountered: