Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data organization and retention policy for local data #154

Closed
jpmckinney opened this issue May 6, 2020 · 5 comments
Closed

Data organization and retention policy for local data #154

jpmckinney opened this issue May 6, 2020 · 5 comments
Labels
discussion S: kingfisher Relating to the Kingfisher servers

Comments

@jpmckinney
Copy link
Member

jpmckinney commented May 6, 2020

cc @romifz @aguilerapy @duncandewhurst since you seem to do most of the local loads.

Follow-up to #155

Data organization

There are quite a few directories in /home/ocdskfp.

Are these directories all local loads? How should we organize these directories?

Proposal:

  • Create a directory for each analyst (like the existing andres and romina directories) so that we know whom to ask about a directory. Within that directory, there would be directories following the pattern of YYYY-MM-DD-<source-id>_local, e.g. 2020-05-05-portugal_local.
    • In future, we can just have user accounts for each analyst, to avoid the overhead of organizing files diligently: Create individual user accounts instead of sharing ocdskfp #142
    • This assumes that analysts are generally not needing to interact with the same files. If that assumption is incorrect, then we should instead have a 'local-load' directory in which we put all data to be local loaded, and follow the pattern above, appending the analyst's name to the directory, e.g. 2020-05-05-portugal_local-duncan.

I see some directories omit the day (DD). Any reason for that?

Data retention

What should be our deletion schedule and archival process? ("Archiving" means holding onto forever.)

My proposed criteria for archival (of either crawled or local data) are:

  • clean (relatively few request errors or other spider issues in the log files)
  • complete (i.e. not a test, sample or filtered collection)
  • 30 days since the last archived data for the same data source (e.g., if we have multiple collections in the same month, we'd archive only the first)

We can adjust the time window and/or have a more complicated schedule, e.g. keep monthly collections within the last year, then keep only annual collections for older years.

Proposal for data retention of local data:

  • If local data doesn't meet the criteria, the analyst should delete their own local data within 90 days. (In practice, they might delete it soon after loading into the database.)
  • If local data meets the criteria, the analyst should move it to an archive directory, to be picked up by an automated process for backup (to be developed).
  • Each quarter, we can review any remaining old local data. Directories created more that 90 days ago can be found using this command:
find . -type d -ctime +90 -not -path '*/ocdskingfisher*' -not -path '*/.*'
@jpmckinney
Copy link
Member Author

@romifz
Copy link

romifz commented May 6, 2020

Are these directories all local loads?

Mines are.

I see some directories omit the day (DD). Any reason for that?

Not really, at least from my part. It would be better to include the day.

This assumes that analysts are generally not needing to interact with the same files.

So far I've never had the need to use files downloaded by someone else.

Both proposals on data organization & retention sound good for me. In general I've never had the need to use locally downloaded or scraped data again after loading to kingfisher-process, except perhaps for one or two times when local files were not correctly loaded due to some mistake of mine.

@duncandewhurst
Copy link
Contributor

Are these directories all local loads?

Mine are.

I see some directories omit the day (DD). Any reason for that?

I did this to avoid cluttering up /home/ocdskfp, because sometimes publishers share several iterations of data in short period.

I'm happy with consistently including the day though.

This assumes that analysts are generally not needing to interact with the same files.

Being able to interact with each other's files is useful for troubleshooting, so I'd favour having a local_load directory and appending the analyst's name to the sub-directories.

The archival and retention policies sound good to me. For local loads, I think we should create some supporting process documentation and perhaps a checklist template in the CRM so the steps aren't forgotten.

@aguilerapy
Copy link
Contributor

Are these directories all local loads?

Mines are.

I see some directories omit the day (DD). Any reason for that?

Just to avoid having too many directories, several data updates can occur in a short period. We can include (DD).

This assumes that analysts are generally not needing to interact with the same files.

I think that being able to access other people's files can be helpful in some cases.

@jpmckinney jpmckinney transferred this issue from open-contracting-archive/kingfisher-vagrant May 20, 2020
@jpmckinney jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 20, 2020
@jpmckinney
Copy link
Member Author

Documented here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#load-local-data

CRM-5533 to apply the policy and process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion S: kingfisher Relating to the Kingfisher servers
Projects
None yet
Development

No branches or pull requests

4 participants