Data organization and retention policy for local data #154

jpmckinney · 2020-05-06T18:34:25Z

cc @romifz @aguilerapy @duncandewhurst since you seem to do most of the local loads.

Follow-up to #155

Data organization

There are quite a few directories in /home/ocdskfp.

Are these directories all local loads? How should we organize these directories?

Proposal:

Create a directory for each analyst (like the existing andres and romina directories) so that we know whom to ask about a directory. Within that directory, there would be directories following the pattern of YYYY-MM-DD-<source-id>_local, e.g. 2020-05-05-portugal_local.
- In future, we can just have user accounts for each analyst, to avoid the overhead of organizing files diligently: Create individual user accounts instead of sharing ocdskfp #142
- This assumes that analysts are generally not needing to interact with the same files. If that assumption is incorrect, then we should instead have a 'local-load' directory in which we put all data to be local loaded, and follow the pattern above, appending the analyst's name to the directory, e.g. 2020-05-05-portugal_local-duncan.

I see some directories omit the day (DD). Any reason for that?

Data retention

What should be our deletion schedule and archival process? ("Archiving" means holding onto forever.)

My proposed criteria for archival (of either crawled or local data) are:

clean (relatively few request errors or other spider issues in the log files)
complete (i.e. not a test, sample or filtered collection)
30 days since the last archived data for the same data source (e.g., if we have multiple collections in the same month, we'd archive only the first)

We can adjust the time window and/or have a more complicated schedule, e.g. keep monthly collections within the last year, then keep only annual collections for older years.

Proposal for data retention of local data:

If local data doesn't meet the criteria, the analyst should delete their own local data within 90 days. (In practice, they might delete it soon after loading into the database.)
If local data meets the criteria, the analyst should move it to an archive directory, to be picked up by an automated process for backup (to be developed).
Each quarter, we can review any remaining old local data. Directories created more that 90 days ago can be found using this command:

find . -type d -ctime +90 -not -path '*/ocdskingfisher*' -not -path '*/.*'

The text was updated successfully, but these errors were encountered:

jpmckinney · 2020-05-06T18:36:03Z

Hosted Kingfisher - Set up archiving procedures for local-loaded files

romifz · 2020-05-06T19:14:40Z

Are these directories all local loads?

Mines are.

I see some directories omit the day (DD). Any reason for that?

Not really, at least from my part. It would be better to include the day.

This assumes that analysts are generally not needing to interact with the same files.

So far I've never had the need to use files downloaded by someone else.

Both proposals on data organization & retention sound good for me. In general I've never had the need to use locally downloaded or scraped data again after loading to kingfisher-process, except perhaps for one or two times when local files were not correctly loaded due to some mistake of mine.

duncandewhurst · 2020-05-06T20:38:22Z

Are these directories all local loads?

Mine are.

I see some directories omit the day (DD). Any reason for that?

I did this to avoid cluttering up /home/ocdskfp, because sometimes publishers share several iterations of data in short period.

I'm happy with consistently including the day though.

This assumes that analysts are generally not needing to interact with the same files.

Being able to interact with each other's files is useful for troubleshooting, so I'd favour having a local_load directory and appending the analyst's name to the sub-directories.

The archival and retention policies sound good to me. For local loads, I think we should create some supporting process documentation and perhaps a checklist template in the CRM so the steps aren't forgotten.

aguilerapy · 2020-05-06T21:24:12Z

Are these directories all local loads?

Mines are.

I see some directories omit the day (DD). Any reason for that?

Just to avoid having too many directories, several data updates can occur in a short period. We can include (DD).

This assumes that analysts are generally not needing to interact with the same files.

I think that being able to access other people's files can be helpful in some cases.

jpmckinney · 2020-07-21T22:47:43Z

Documented here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#load-local-data

CRM-5533 to apply the policy and process.

jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant May 20, 2020

jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 20, 2020

robredpath mentioned this issue May 20, 2020

Service continuity + backup requirements for hosted kingfisher service #155

Closed

jpmckinney added for: ODSC discussion and removed for: ODSC labels May 26, 2020

jpmckinney mentioned this issue Jul 20, 2020

Data retention policy for crawled data #153

Closed

jpmckinney closed this as completed Jul 21, 2020

jpmckinney mentioned this issue Jul 30, 2020

Kingfisher Archive: Disk clean up #150

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data organization and retention policy for local data #154

Data organization and retention policy for local data #154

jpmckinney commented May 6, 2020 •

edited

Loading

jpmckinney commented May 6, 2020

romifz commented May 6, 2020

duncandewhurst commented May 6, 2020

aguilerapy commented May 6, 2020

jpmckinney commented Jul 21, 2020

Data organization and retention policy for local data #154

Data organization and retention policy for local data #154

Comments

jpmckinney commented May 6, 2020 • edited Loading