-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data organization and retention policy for local data #154
Comments
Mines are.
Not really, at least from my part. It would be better to include the day.
So far I've never had the need to use files downloaded by someone else. Both proposals on data organization & retention sound good for me. In general I've never had the need to use locally downloaded or scraped data again after loading to kingfisher-process, except perhaps for one or two times when local files were not correctly loaded due to some mistake of mine. |
Mine are.
I did this to avoid cluttering up /home/ocdskfp, because sometimes publishers share several iterations of data in short period. I'm happy with consistently including the day though.
Being able to interact with each other's files is useful for troubleshooting, so I'd favour having a local_load directory and appending the analyst's name to the sub-directories. The archival and retention policies sound good to me. For local loads, I think we should create some supporting process documentation and perhaps a checklist template in the CRM so the steps aren't forgotten. |
Mines are.
Just to avoid having too many directories, several data updates can occur in a short period. We can include (DD).
I think that being able to access other people's files can be helpful in some cases. |
Documented here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#load-local-data CRM-5533 to apply the policy and process. |
cc @romifz @aguilerapy @duncandewhurst since you seem to do most of the local loads.
Follow-up to #155
Data organization
There are quite a few directories in /home/ocdskfp.
Are these directories all local loads? How should we organize these directories?
Proposal:
andres
andromina
directories) so that we know whom to ask about a directory. Within that directory, there would be directories following the pattern ofYYYY-MM-DD-<source-id>_local
, e.g.2020-05-05-portugal_local
.2020-05-05-portugal_local-duncan
.I see some directories omit the day (
DD
). Any reason for that?Data retention
What should be our deletion schedule and archival process? ("Archiving" means holding onto forever.)
My proposed criteria for archival (of either crawled or local data) are:
We can adjust the time window and/or have a more complicated schedule, e.g. keep monthly collections within the last year, then keep only annual collections for older years.
Proposal for data retention of local data:
archive
directory, to be picked up by an automated process for backup (to be developed).The text was updated successfully, but these errors were encountered: