-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kingfisher Archive: Disk clean up #150
Comments
I've reviewed the contents of /home/archive, as I believe I created those directories. It's hard to say to what degree we 'need' the data in those directories. I don't see any sign that it's in regular use as part of anyone's job right now, however the purpose of the archive was to provide some degree of an historical record of data, in case we ever wanted to do change-over-time analysis, or suspected that records had been altered. The reason that /home/archive/ocdsdata_archive is so much smaller is that I compressed the directories massively, so that the data wasn't lost, but took up a trivially small amount of space. The archive server isn't just a backup server, after all. @jpmckinney I'm happy to remove any of the data in /home/archive if you're happy that we will never want to do that kind of analysis. Otherwise, I'd be happy to compress all the data there in the same way, and leave an appropriate README to express its purpose. It's likely to be small enough to handle on a personal machine once compressed - in the order of 10GB or so - so could be handled manually when we migrate away from the archive server in time. I don't think we need any of the other files, either, but I'll check this with others this week.
I think that's true as well. But, again, I'll check with others this week. |
@robredpath Can you explain the differences (e.g. in terms of provenance or otherwise) between ocdsdata_archive, ocdskfs-old-data and ocdskingfisher-user-data? I also see some overlap between ocdskingfisher-user-data and the actively-updated |
Late Nov 2019 was the server crash when all our hard disks went. So I think these are backups related to that and can now be just removed.
I'm not sure what these are. There was an attempt to set up replication at one point that was dropped, could be to do with that. But in any case, I think we can say remove them.
These are all from the first version of Kingfisher, before the rewrite that separated scrape and process. (metadb.sqlite files on disk are the giveaway). The first one is compressed files, the others not.
Just because both have afghanistan_records doesn't mean there is overlap. The same source may have been run multiple times. Looking at the dates for each afghanistan_records, one runs 2018-10-03 to 2019-02-07 and the other runs 2019-02-07 to 2020-01-23. The ones on 2019-02-07 have different times, oddly.
Let me comment on that shortly |
To build on Rob's comment,
Because of this I would suggest this isn't the place to make any decisions - I'd suggest
I see issues of removing specific publishers totally, or removing old data sets, or removing some data sets if we have many for the same publisher as issues that will apply to our new system too, and also as issues that the analysts can help us clarify user needs on - so I'd move any such considerations to a future conversation around our new system. |
We've had those discussions: #153 #154 The data retention policy is here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#data-retention-policy Now, to apply the policy, I need to know more about the provenance of:
Okay, but how are they different? Are they different in terms of provenance (one old directory versus another)? In terms of time (one copied over at one time versus another)? Can nothing be inferred from the directory names? |
So, I have:
It looks broadly like the directories follow on from each other - I think, therefore, that we've got enough to meaningfully archive this data. We know the name of the source and the date of creation for each directory of data. Some of it will certainly require different handling because it came from a different system, but that's something we can document. @jpmckinney @odscjames shall we, then, run through each of these directories, and:
|
I wouldn't remember enough to be able to answer with certainty, I'm sorry.
Thanks, that's good to read. In terms of the 3 clauses (Clean, Complete, Periodic): The first 2 are hard to evaluate just from the data. In terms of clean, some errors may be things the analysts note as "bad data" in their report while still continuing to review the rest of the data. Other errors may mean that Analysts write a whole collection off. We can't know just from what we have on disk. This is true of current runs we save too. Complete is also hard to tell, though we can take out the sample ones easily at least. I'd maybe suggest putting this on hold until we do open-contracting-archive/kingfisher-archive#10 - which I think we want to talk about next anyway. That's seems better anyway (if we de-active the archive server THEN switch to S3 there is a period in the middle with no active archive) and also in working out and agreeing a plan for S3 I suspect we would go into some issues (like the above) that might help resolve this one. |
Yes, these would have to be supported by the Scrapyd logs (see #153 (comment) and #135), and in any case rely on either human input or on "good-enough" automatic criteria. I'll add clarification to the "Clean" point, which is about errors in data collection not in the data itself – so it's not an issue of "bad data". I don't understand why this would be put on hold in favor of open-contracting-archive/kingfisher-archive#10. It's better to take this opportunity to figure out a good process for identifying the data to archive, than to start sending data to S3 without having first figured that out... Anyway, I think I have all the information I'll get about the existing data, so I can make progress on this issue and #153 (comment) |
Cleanup log (will be continuously updated). Preparation
Decompress archives
Look for unusual files
In
Similarly, Remove duplicate directoriesI stored the output of
Merge the directoriesCompare the time spans of each directory:
The time spans are non-overlapping. We can merge the directories into one hierarchy: mkdir merge
cp -al ocdsdata_archive/* ocdskingfisher-user-data/* ocdskfs-old-data/* merge
rm -rf ocdsdata_archive ocdskingfisher-user-data ocdskfs-old-data |
Apply criteriaDelete samples
Delete empty directories
Find unclean collections
Apply criteria
Before starting the sheet, I had already started deleting some based on a less robust process: mexico_administracion_publica_federal
uk_contracts_finder
|
Done! See spreadsheet for determinations of what was kept/deleted. Renamed the directories to match the new style (this makes a lot of warnings, but it succeeds):
|
Do we still need any of these?
I assume the
analysis
andocdskfp
home directories only exist from when Kingfisher was deployed to this server, and can be deleted.The text was updated successfully, but these errors were encountered: