Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kingfisher Archive: Disk clean up #150

Closed
7 tasks done
jpmckinney opened this issue May 12, 2020 · 13 comments
Closed
7 tasks done

Kingfisher Archive: Disk clean up #150

jpmckinney opened this issue May 12, 2020 · 13 comments

Comments

@jpmckinney
Copy link
Member

jpmckinney commented May 12, 2020

Do we still need any of these?

  • /root
    • 2019-nov-main-server-backup (961G)
    • 2019-dec-main-server-backup (70GB)
  • /home/ocdskingfisher
    • ocdskingfisher.dump (March 2019)
    • ocdsdata.dump (March 2019)
  • /home/archive
    • ocdsdata_archive (March 2019, 8GB)
    • ocdskfs-old-data/old-data (August 2019, 69GB)
    • ocdskingfisher-user-data/data (August 2019, 72GB).

I assume the analysis and ocdskfp home directories only exist from when Kingfisher was deployed to this server, and can be deleted.

@jpmckinney jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 12, 2020
@jpmckinney
Copy link
Member Author

@jpmckinney jpmckinney added kingfisher-archive and removed S: kingfisher Relating to the Kingfisher servers labels May 20, 2020
@robredpath
Copy link
Contributor

I've reviewed the contents of /home/archive, as I believe I created those directories.

It's hard to say to what degree we 'need' the data in those directories. I don't see any sign that it's in regular use as part of anyone's job right now, however the purpose of the archive was to provide some degree of an historical record of data, in case we ever wanted to do change-over-time analysis, or suspected that records had been altered. The reason that /home/archive/ocdsdata_archive is so much smaller is that I compressed the directories massively, so that the data wasn't lost, but took up a trivially small amount of space. The archive server isn't just a backup server, after all.

@jpmckinney I'm happy to remove any of the data in /home/archive if you're happy that we will never want to do that kind of analysis. Otherwise, I'd be happy to compress all the data there in the same way, and leave an appropriate README to express its purpose. It's likely to be small enough to handle on a personal machine once compressed - in the order of 10GB or so - so could be handled manually when we migrate away from the archive server in time.

I don't think we need any of the other files, either, but I'll check this with others this week.

I assume the analysis and ocdskfp home directories only exist from when Kingfisher was deployed to this server, and can be deleted.

I think that's true as well. But, again, I'll check with others this week.

@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 27, 2020

@robredpath Can you explain the differences (e.g. in terms of provenance or otherwise) between ocdsdata_archive, ocdskfs-old-data and ocdskingfisher-user-data?

I also see some overlap between ocdskingfisher-user-data and the actively-updated data directory (e.g. some afghanistan_records folders) and and between ocdskfs-old-data and ocdskingfisher-user-data (e.g. some chile_compra folders). Archival is only useful if properly labelled and categorized, without duplication, so we need to sort this out.

@odscjames
Copy link
Contributor

/root/2019-nov-main-server-backup (961G)
/root/2019-dec-main-server-backup (70GB)

Late Nov 2019 was the server crash when all our hard disks went. So I think these are backups related to that and can now be just removed.

/home/ocdskingfisher/*.dump

I'm not sure what these are. There was an attempt to set up replication at one point that was dropped, could be to do with that. But in any case, I think we can say remove them.

/home/archive/ocdsdata_archive
/home/archive/ocdskfs-old-data/old-data
/home/archive/ocdskingfisher-user-data/data

These are all from the first version of Kingfisher, before the rewrite that separated scrape and process. (metadb.sqlite files on disk are the giveaway). The first one is compressed files, the others not.

I also see some overlap between ocdskingfisher-user-data and the actively-updated data directory (e.g. some afghanistan_records folders)

Just because both have afghanistan_records doesn't mean there is overlap. The same source may have been run multiple times. Looking at the dates for each afghanistan_records, one runs 2018-10-03 to 2019-02-07 and the other runs 2019-02-07 to 2020-01-23. The ones on 2019-02-07 have different times, oddly.

Do we still need any of these?

Let me comment on that shortly

@odscjames
Copy link
Contributor

odscjames commented Jul 30, 2020

Do we still need any of these?

To build on Rob's comment,

however the purpose of the archive was to provide some degree of an historical record of data, in case we ever wanted to do change-over-time analysis, or suspected that records had been altered.

Because of this I would suggest this isn't the place to make any decisions - I'd suggest

  • we remove the random data (backup and dump)
  • For all data that is organised by source and date, we make sure we migrate them to the new system that we will be setting up as we work on the task "Move to S3 or equiv and shutdown archive server".

I see issues of removing specific publishers totally, or removing old data sets, or removing some data sets if we have many for the same publisher as issues that will apply to our new system too, and also as issues that the analysts can help us clarify user needs on - so I'd move any such considerations to a future conversation around our new system.

@jpmckinney
Copy link
Member Author

We've had those discussions: #153 #154 The data retention policy is here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#data-retention-policy

Now, to apply the policy, I need to know more about the provenance of:

/home/archive/ocdsdata_archive
/home/archive/ocdskfs-old-data/old-data
/home/archive/ocdskingfisher-user-data/data

These are all from the first version of Kingfisher, before the rewrite that separated scrape and process. (metadb.sqlite files on disk are the giveaway). The first one is compressed files, the others not.

Okay, but how are they different? Are they different in terms of provenance (one old directory versus another)? In terms of time (one copied over at one time versus another)? Can nothing be inferred from the directory names?

@robredpath
Copy link
Contributor

So, I have:

It looks broadly like the directories follow on from each other - ocdsdata_archive, then ocdskingfisher-user-data, then ocdskfs-old-data . The directories are consistently named.

I think, therefore, that we've got enough to meaningfully archive this data. We know the name of the source and the date of creation for each directory of data. Some of it will certainly require different handling because it came from a different system, but that's something we can document.

@jpmckinney @odscjames shall we, then, run through each of these directories, and:

  • Delete any that have a corresponding entry already in /home/archive/data
  • Create a tarball for each one that doesn't have a corresponding entry already, in the manner of the output of the current version of the kf-archive script ?

@odscjames
Copy link
Contributor

Okay, but how are they different? Are they different in terms of provenance (one old directory versus another)? In terms of time (one copied over at one time versus another)? Can nothing be inferred from the directory names?

I wouldn't remember enough to be able to answer with certainty, I'm sorry.

The data retention policy is here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#data-retention-policy

Thanks, that's good to read.

In terms of the 3 clauses (Clean, Complete, Periodic): The first 2 are hard to evaluate just from the data. In terms of clean, some errors may be things the analysts note as "bad data" in their report while still continuing to review the rest of the data. Other errors may mean that Analysts write a whole collection off. We can't know just from what we have on disk. This is true of current runs we save too. Complete is also hard to tell, though we can take out the sample ones easily at least.

I'd maybe suggest putting this on hold until we do open-contracting-archive/kingfisher-archive#10 - which I think we want to talk about next anyway. That's seems better anyway (if we de-active the archive server THEN switch to S3 there is a period in the middle with no active archive) and also in working out and agreeing a plan for S3 I suspect we would go into some issues (like the above) that might help resolve this one.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 4, 2020

The first 2 are hard to evaluate just from the data

Yes, these would have to be supported by the Scrapyd logs (see #153 (comment) and #135), and in any case rely on either human input or on "good-enough" automatic criteria. I'll add clarification to the "Clean" point, which is about errors in data collection not in the data itself – so it's not an issue of "bad data".

I don't understand why this would be put on hold in favor of open-contracting-archive/kingfisher-archive#10. It's better to take this opportunity to figure out a good process for identifying the data to archive, than to start sending data to S3 without having first figured that out...

Anyway, I think I have all the information I'll get about the existing data, so I can make progress on this issue and #153 (comment)

@robredpath
Copy link
Contributor

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 5, 2020

Cleanup log (will be continuously updated).

Preparation

  • Reduce unnecessary file hierarchy (e.g. ocdsdata_archive/data -> ocdsdata_archive)

Decompress archives

  • Decompress .tar.xz files in ocdsdata_archive (performed earlier by @robredpath)
  • Decompress 2 .tar.xz files in ocdskingfisher-user-data
  • Delete .tar.xz files

Look for unusual files

find . -type f -not -name 'metadb.sqlite3' -not -name '*.json' -not -name '*.jsonlines'

  • *.sqlite3-journal files were deleted.
  • afghanistan_records has filenames that look like md5 sums, instead of *.json files.
  • There were some archive files containing data from existing directories. I decompressed those files and recursive diff'ed with those directories (no differences). So I deleted them.
    • ocdskingfisher-user-data/uk_contracts_finder/uk_data.tar.gz (2018-10-15-11-56-08)
    • ocdskingfisher-user-data/uruguay/uruguay.tar.gz (2018-10-01-15-00-06)
  • digiwhist_* directories contain data.jsonlines and data.jsonlines-temp.tar.gz pairs. I decompressed one and used cmp; the decompressed file was identical to data.jsonlines. I ran find . -name '*.jsonlines-temp.tar.gz' -exec sh -c 'echo {}; tar tvf {}' \; to check that the archive files contained a single file in a first few cases. I ran find . -name '*.jsonlines-temp.tar.gz' -exec sh -c 'ls $(echo {} | cut -d- -f1-6)' \; to check that all archive files had a decompressed pair. Then, i deleted the archive files.

In ocdsdata_archive/australia/2018-06-19-15-05-38:

  • There's both a metadb.sqlite3 and metadb.sqlite3REALONE with the same size in bytes, and the same number of rows in filestatus. I dumped (sqlite3 metadb.sqlite3 .dump > a) and compared both (diff -U0 a b | less). The only differences are that the "REALONE" has different datetimes, and the other one has start/finished times and fetch_success for the session.
  • This directory, uniquely, has *.json-fixed files, whose first part (up to -fixed) always matches another file. About half the .json files have a *.json-fixed pair. I ran find . -name 'packages-*-fixed' -exec sh -c 'diff -q {} $(echo {} | cut -c1-48)' \; and find . -name 'type-*-fixed' -exec sh -c 'diff -q {} $(echo {} | cut -d- -f1-5)' \;. I checked the single pair of files reported in the output, and the difference is that fixed files have NULL bytes (\u0000) removed.
  • So, I deleted metadb.sqlite3 and renamed metadb.sqlite3REALONE to metadb.sqlite3. And I deleted all *-fixed files (find . -name '*.json-fixed' -exec rm -f {} \; because the argument list using bash expansion was too long), since we can handle NULL bytes in Kingfisher Process.

Similarly, ocdskingfisher-user-data/australia/2019-01-24-16-58-13 has metadb.sqlite3 and metadb.sqlite3.complete. The complete file has a single extra filestatus row. The fetch start/finish times are two minutes earlier, and the fetch_status is 0 instead of 1. Since it has an extra row, I deleted metadb.sqlite3 and renamed metadb.sqlite3.complete to metadb.sqlite3.

Remove duplicate directories

I stored the output of find . -mindepth 2 -type d | sort for each of the three top-level directories, and ran comm -12 a b on each pair of files:

  • chile_compra/2019-01-14-10-35-57 is an empty directory in ocdskfs-old-data, so it was deleted.
  • paraguay_hacienda/2018-09-03-22-29-55 has fewer files in ocdsdata_archive than in ``ocdskingfisher-user-data(but not different files, except formetadb.sqlite3`), so it was deleted.

Merge the directories

Compare the time spans of each directory:

find . -mindepth 2 -type d | cut -d/ -f3 | sort

  • ocdsdata_archive: 2018-06-07-13-17-30 to 2018-08-28-17-04-49
  • ocdskingfisher-user-data: 2018-09-03-22-29-55 to 2019-02-12-16-56-25
  • ocdskfs-old-data: 2019-02-26-12-15-28 to 2019-07-17-14-37-03

The time spans are non-overlapping.

We can merge the directories into one hierarchy:

mkdir merge
cp -al ocdsdata_archive/* ocdskingfisher-user-data/* ocdskfs-old-data/* merge
rm -rf ocdsdata_archive ocdskingfisher-user-data ocdskfs-old-data

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 5, 2020

Apply criteria

Delete samples

  • Check the sizes of samples:

      find . -type d -name '*_sample' -exec du -h {} \;
    
  • Delete samples:

      find . -type d -name '*_sample' -exec rm -rf {} \;
    
    • australia_sample (632K)
    • mexico_jalisco_sample (1.9M)
    • moldova_sample (51M)
    • paraguay_dncp_sample (204K)
    • paraguay_hacienda_sample (240K)
    • uganda_sample (112K)
  • Check for samples according to metadb.sqlite3 (none found):

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT sample FROM session") -eq 1' \; -print | sort
    

Delete empty directories

  • Find directories without metadb.sqlite3 (none found):

      find -mindepth 2 -type d -exec test ! -e {}/metadb.sqlite3 \; -print
    
  • Find directories with 0 or 1 files:

      find -mindepth 2 -type d -exec sh -c 'test $(ls {} | wc -l) -le 1' \; -print
    
  • Delete directories with only metadb.sqlite3 files:

    • afghanistan_records/2019-02-07-14-32-52
    • chile_compra/2019-03-18-12-46-51
    • honduras_sefin/2018-12-24-14-51-32
    • indonesia_bandung/2019-03-13-14-38-40
    • local_load/2018-11-27-12-12-19
    • local_load/2018-11-27-12-12-58
    • mexico_jalisco/2019-03-27-14-30-24
    • paraguay_dncp/2018-11-20-14-01-38
    • stralia_nsw/2019-01-16-16-51-49
    • taiwan/2019-01-15-18-28-20
    • uk_contracts_finder/2018-12-06-12-55-19
    • uruguay/2018-10-01-12-58-58
    • uruguay/2019-01-15-18-06-23
  • Look what's in the smallest directories (looks fine):

      ls $(du */* | sort -n | head | cut -f2 | xargs)
    

Find unclean collections

  • Check gather failures (none):

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(gather_success) FROM session") -eq 1' \; -print | sort
    
  • Check fetch failures (many):

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(fetch_success) FROM session") -eq 1' \; -print | sort
    
  • Look into specific failures. Some are "COULD NOT SEND TO SERVER! HTTP CODE: 500". Browsing Kingfisher Collect's code from that time, this error occurs if the response from Kingfisher Process is an error (no problem). Erase it:

      find . -name metadb.sqlite3 -exec sqlite3 {} $'UPDATE filestatus SET fetch_errors = \'[]\', fetch_success = 1 WHERE fetch_success = 0 AND fetch_errors = \'["Exception(\'\'COULD NOT SEND TO SERVER! HTTP CODE: 500\'\',)"]\'' \;
    
      find . -name metadb.sqlite3 -exec sqlite3 {} 'UPDATE session SET fetch_success = NOT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)' \;
    
  • Look into specific failures, again:

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)") -eq 1' \; -print -exec sqlite3 {} 'SELECT DISTINCT(fetch_errors) FROM filestatus WHERE fetch_success = 0' \;
    
  • A lot are like Request exception (Code 500): 500 Server Error: Internal Server Error for url: <URL>, with unique URLs (actual errors). Hide those to see what the others are (hard to say if all are collection errors, but we'll assume so):

      find . -name metadb.sqlite3 -exec sh -c "test \$(sqlite3 {} \"SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0 AND fetch_errors NOT LIKE '[\\\"Request exception (Code %')\") -eq 1" \; -print -exec sqlite3 {} "SELECT DISTINCT(fetch_errors) FROM filestatus WHERE fetch_success = 0 AND fetch_errors NOT LIKE '[\"Request exception (Code %'" \;
    
    • ./australia/2018-06-19-15-05-38/metadb.sqlite3
      • ["IntegrityError('(sqlite3.IntegrityError) UNIQUE constraint failed: filestatus.filename',)"]
    • ./armenia/2018-06-19-14-21-06/metadb.sqlite3
      • ["('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))"]
    • ./moldova/2019-01-16-16-42-03/metadb.sqlite3
      • ["must be str, not bytes"]
    • ./australia_nsw/2019-01-16-16-52-07/metadb.sqlite3
      • ["must be str, not bytes"]
    • ./ukraine/2018-11-07-16-48-03/metadb.sqlite3
      • ["must be str, not bytes"]
    • ./colombia/2019-02-26-12-15-28/metadb.sqlite3
      • ["JSONDecodeError('Expecting value: line 1 column 2172 (char 2171)',)"]
    • ./paraguay_dncp/2019-01-29-10-51-51/metadb.sqlite3
      • ["Connection error", "Request exception (Code 401): 401 Client Error: Unauthorized for url: https://www.contrataciones.gov.py:443/datos/api/v2/doc/ocds/record-package/328498"]
      • ["Connection error", "Request exception (Code 401): 401 Client Error: Unauthorized for url: https://www.contrataciones.gov.py:443/datos/api/v2/doc/ocds/record-package/348237"]
  • So, we have 18 unclean collections out of about 170:

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(fetch_success) FROM session") -eq 1' \; -print | sort
    
  • We can see the proportion of files with errors:

      find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)") -eq 1' \; -print -exec sqlite3 {} "SELECT COUNT(CASE WHEN fetch_success = 0 THEN 1 END), COUNT(*) FROM filestatus" \;
    

Apply criteria

Before starting the sheet, I had already started deleting some based on a less robust process:

mexico_administracion_publica_federal

  • keep 2018-06-14-13-58-17 (3.1G)
  • delete 2018-08-15-20-51-44 (3.2G)

uk_contracts_finder

  • keep 2018-06-07-13-17-30 (615M)
  • delete 2018-08-13-20-44-23 (662M)
  • delete 2018-08-15-18-45-48 (664M)

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 6, 2020

Done! See spreadsheet for determinations of what was kept/deleted.

Renamed the directories to match the new style (this makes a lot of warnings, but it succeeds):

find . -mindepth 2 -type d -exec sh -c 'var={}; dir=${var#*/}; mv {} $(echo $dir | tr "/" "_" | sed "s/\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)/\1\2\3-\4\5\6/" | tr "-" "_"); rmdir ${dir%%/*}' \;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants