Kingfisher Archive: Disk clean up #150

jpmckinney · 2020-05-12T17:33:57Z

Do we still need any of these?

/root
- 2019-nov-main-server-backup (961G)
- 2019-dec-main-server-backup (70GB)
/home/ocdskingfisher
- ocdskingfisher.dump (March 2019)
- ocdsdata.dump (March 2019)
/home/archive
- ocdsdata_archive (March 2019, 8GB)
- ocdskfs-old-data/old-data (August 2019, 69GB)
- ocdskingfisher-user-data/data (August 2019, 72GB).

I assume the analysis and ocdskfp home directories only exist from when Kingfisher was deployed to this server, and can be deleted.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2020-05-12T17:34:33Z

Kingfisher Archive: Disk clean up

robredpath · 2020-07-27T15:20:40Z

I've reviewed the contents of /home/archive, as I believe I created those directories.

It's hard to say to what degree we 'need' the data in those directories. I don't see any sign that it's in regular use as part of anyone's job right now, however the purpose of the archive was to provide some degree of an historical record of data, in case we ever wanted to do change-over-time analysis, or suspected that records had been altered. The reason that /home/archive/ocdsdata_archive is so much smaller is that I compressed the directories massively, so that the data wasn't lost, but took up a trivially small amount of space. The archive server isn't just a backup server, after all.

@jpmckinney I'm happy to remove any of the data in /home/archive if you're happy that we will never want to do that kind of analysis. Otherwise, I'd be happy to compress all the data there in the same way, and leave an appropriate README to express its purpose. It's likely to be small enough to handle on a personal machine once compressed - in the order of 10GB or so - so could be handled manually when we migrate away from the archive server in time.

I don't think we need any of the other files, either, but I'll check this with others this week.

I assume the analysis and ocdskfp home directories only exist from when Kingfisher was deployed to this server, and can be deleted.

I think that's true as well. But, again, I'll check with others this week.

jpmckinney · 2020-07-27T17:14:18Z

@robredpath Can you explain the differences (e.g. in terms of provenance or otherwise) between ocdsdata_archive, ocdskfs-old-data and ocdskingfisher-user-data?

I also see some overlap between ocdskingfisher-user-data and the actively-updated data directory (e.g. some afghanistan_records folders) and and between ocdskfs-old-data and ocdskingfisher-user-data (e.g. some chile_compra folders). Archival is only useful if properly labelled and categorized, without duplication, so we need to sort this out.

odscjames · 2020-07-30T12:49:21Z

/root/2019-nov-main-server-backup (961G)
/root/2019-dec-main-server-backup (70GB)

Late Nov 2019 was the server crash when all our hard disks went. So I think these are backups related to that and can now be just removed.

/home/ocdskingfisher/*.dump

I'm not sure what these are. There was an attempt to set up replication at one point that was dropped, could be to do with that. But in any case, I think we can say remove them.

/home/archive/ocdsdata_archive
/home/archive/ocdskfs-old-data/old-data
/home/archive/ocdskingfisher-user-data/data

These are all from the first version of Kingfisher, before the rewrite that separated scrape and process. (metadb.sqlite files on disk are the giveaway). The first one is compressed files, the others not.

I also see some overlap between ocdskingfisher-user-data and the actively-updated data directory (e.g. some afghanistan_records folders)

Just because both have afghanistan_records doesn't mean there is overlap. The same source may have been run multiple times. Looking at the dates for each afghanistan_records, one runs 2018-10-03 to 2019-02-07 and the other runs 2019-02-07 to 2020-01-23. The ones on 2019-02-07 have different times, oddly.

Do we still need any of these?

Let me comment on that shortly

odscjames · 2020-07-30T13:28:42Z

Do we still need any of these?

To build on Rob's comment,

however the purpose of the archive was to provide some degree of an historical record of data, in case we ever wanted to do change-over-time analysis, or suspected that records had been altered.

Because of this I would suggest this isn't the place to make any decisions - I'd suggest

we remove the random data (backup and dump)
For all data that is organised by source and date, we make sure we migrate them to the new system that we will be setting up as we work on the task "Move to S3 or equiv and shutdown archive server".

I see issues of removing specific publishers totally, or removing old data sets, or removing some data sets if we have many for the same publisher as issues that will apply to our new system too, and also as issues that the analysts can help us clarify user needs on - so I'd move any such considerations to a future conversation around our new system.

jpmckinney · 2020-07-30T14:10:12Z

We've had those discussions: #153 #154 The data retention policy is here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#data-retention-policy

Now, to apply the policy, I need to know more about the provenance of:

/home/archive/ocdsdata_archive
/home/archive/ocdskfs-old-data/old-data
/home/archive/ocdskingfisher-user-data/data

These are all from the first version of Kingfisher, before the rewrite that separated scrape and process. (metadb.sqlite files on disk are the giveaway). The first one is compressed files, the others not.

Okay, but how are they different? Are they different in terms of provenance (one old directory versus another)? In terms of time (one copied over at one time versus another)? Can nothing be inferred from the directory names?

robredpath · 2020-08-03T09:44:17Z

So, I have:

Uncompressed the contents of /home/archive/ocdsdata_archive
Run a find to order the directories by date order - see https://gist.github.com/robredpath/4b0c6b7c24c97c805014ea67561c70e7 for the results

It looks broadly like the directories follow on from each other - ocdsdata_archive, then ocdskingfisher-user-data, then ocdskfs-old-data . The directories are consistently named.

I think, therefore, that we've got enough to meaningfully archive this data. We know the name of the source and the date of creation for each directory of data. Some of it will certainly require different handling because it came from a different system, but that's something we can document.

@jpmckinney @odscjames shall we, then, run through each of these directories, and:

Delete any that have a corresponding entry already in /home/archive/data
Create a tarball for each one that doesn't have a corresponding entry already, in the manner of the output of the current version of the kf-archive script ?

odscjames · 2020-08-04T13:33:40Z

Okay, but how are they different? Are they different in terms of provenance (one old directory versus another)? In terms of time (one copied over at one time versus another)? Can nothing be inferred from the directory names?

I wouldn't remember enough to be able to answer with certainty, I'm sorry.

The data retention policy is here: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-process.html#data-retention-policy

Thanks, that's good to read.

In terms of the 3 clauses (Clean, Complete, Periodic): The first 2 are hard to evaluate just from the data. In terms of clean, some errors may be things the analysts note as "bad data" in their report while still continuing to review the rest of the data. Other errors may mean that Analysts write a whole collection off. We can't know just from what we have on disk. This is true of current runs we save too. Complete is also hard to tell, though we can take out the sample ones easily at least.

I'd maybe suggest putting this on hold until we do open-contracting-archive/kingfisher-archive#10 - which I think we want to talk about next anyway. That's seems better anyway (if we de-active the archive server THEN switch to S3 there is a period in the middle with no active archive) and also in working out and agreeing a plan for S3 I suspect we would go into some issues (like the above) that might help resolve this one.

jpmckinney · 2020-08-04T19:42:07Z

The first 2 are hard to evaluate just from the data

Yes, these would have to be supported by the Scrapyd logs (see #153 (comment) and #135), and in any case rely on either human input or on "good-enough" automatic criteria. I'll add clarification to the "Clean" point, which is about errors in data collection not in the data itself – so it's not an issue of "bad data".

I don't understand why this would be put on hold in favor of open-contracting-archive/kingfisher-archive#10. It's better to take this opportunity to figure out a good process for identifying the data to archive, than to start sending data to S3 without having first figured that out...

Anyway, I think I have all the information I'll get about the existing data, so I can make progress on this issue and #153 (comment)

robredpath · 2020-08-05T14:15:04Z

Archive to S3 and decommission server

jpmckinney · 2020-08-05T18:36:55Z

Cleanup log (will be continuously updated).

Preparation

Reduce unnecessary file hierarchy (e.g. ocdsdata_archive/data -> ocdsdata_archive)

Decompress archives

Decompress .tar.xz files in ocdsdata_archive (performed earlier by @robredpath)
Decompress 2 .tar.xz files in ocdskingfisher-user-data
Delete .tar.xz files

Look for unusual files

find . -type f -not -name 'metadb.sqlite3' -not -name '*.json' -not -name '*.jsonlines'

*.sqlite3-journal files were deleted.
afghanistan_records has filenames that look like md5 sums, instead of *.json files.
There were some archive files containing data from existing directories. I decompressed those files and recursive diff'ed with those directories (no differences). So I deleted them.
- ocdskingfisher-user-data/uk_contracts_finder/uk_data.tar.gz (2018-10-15-11-56-08)
- ocdskingfisher-user-data/uruguay/uruguay.tar.gz (2018-10-01-15-00-06)
digiwhist_* directories contain data.jsonlines and data.jsonlines-temp.tar.gz pairs. I decompressed one and used cmp; the decompressed file was identical to data.jsonlines. I ran find . -name '*.jsonlines-temp.tar.gz' -exec sh -c 'echo {}; tar tvf {}' \; to check that the archive files contained a single file in a first few cases. I ran find . -name '*.jsonlines-temp.tar.gz' -exec sh -c 'ls $(echo {} | cut -d- -f1-6)' \; to check that all archive files had a decompressed pair. Then, i deleted the archive files.

In ocdsdata_archive/australia/2018-06-19-15-05-38:

There's both a metadb.sqlite3 and metadb.sqlite3REALONE with the same size in bytes, and the same number of rows in filestatus. I dumped (sqlite3 metadb.sqlite3 .dump > a) and compared both (diff -U0 a b | less). The only differences are that the "REALONE" has different datetimes, and the other one has start/finished times and fetch_success for the session.
This directory, uniquely, has *.json-fixed files, whose first part (up to -fixed) always matches another file. About half the .json files have a *.json-fixed pair. I ran find . -name 'packages-*-fixed' -exec sh -c 'diff -q {} $(echo {} | cut -c1-48)' \; and find . -name 'type-*-fixed' -exec sh -c 'diff -q {} $(echo {} | cut -d- -f1-5)' \;. I checked the single pair of files reported in the output, and the difference is that fixed files have NULL bytes (\u0000) removed.
So, I deleted metadb.sqlite3 and renamed metadb.sqlite3REALONE to metadb.sqlite3. And I deleted all *-fixed files (find . -name '*.json-fixed' -exec rm -f {} \; because the argument list using bash expansion was too long), since we can handle NULL bytes in Kingfisher Process.

Similarly, ocdskingfisher-user-data/australia/2019-01-24-16-58-13 has metadb.sqlite3 and metadb.sqlite3.complete. The complete file has a single extra filestatus row. The fetch start/finish times are two minutes earlier, and the fetch_status is 0 instead of 1. Since it has an extra row, I deleted metadb.sqlite3 and renamed metadb.sqlite3.complete to metadb.sqlite3.

Remove duplicate directories

I stored the output of find . -mindepth 2 -type d | sort for each of the three top-level directories, and ran comm -12 a b on each pair of files:

chile_compra/2019-01-14-10-35-57 is an empty directory in ocdskfs-old-data, so it was deleted.
paraguay_hacienda/2018-09-03-22-29-55 has fewer files in ocdsdata_archive than in ``ocdskingfisher-user-data(but not different files, except formetadb.sqlite3`), so it was deleted.

Merge the directories

Compare the time spans of each directory:

find . -mindepth 2 -type d | cut -d/ -f3 | sort

ocdsdata_archive: 2018-06-07-13-17-30 to 2018-08-28-17-04-49
ocdskingfisher-user-data: 2018-09-03-22-29-55 to 2019-02-12-16-56-25
ocdskfs-old-data: 2019-02-26-12-15-28 to 2019-07-17-14-37-03

The time spans are non-overlapping.

We can merge the directories into one hierarchy:

mkdir merge
cp -al ocdsdata_archive/* ocdskingfisher-user-data/* ocdskfs-old-data/* merge
rm -rf ocdsdata_archive ocdskingfisher-user-data ocdskfs-old-data

jpmckinney · 2020-08-05T22:45:33Z

Apply criteria

Delete samples

Check the sizes of samples:

  find . -type d -name '*_sample' -exec du -h {} \;

Delete samples:
```
  find . -type d -name '*_sample' -exec rm -rf {} \;
```
- australia_sample (632K)
- mexico_jalisco_sample (1.9M)
- moldova_sample (51M)
- paraguay_dncp_sample (204K)
- paraguay_hacienda_sample (240K)
- uganda_sample (112K)

Check for samples according to metadb.sqlite3 (none found):

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT sample FROM session") -eq 1' \; -print | sort

Delete empty directories

Find directories without metadb.sqlite3 (none found):

  find -mindepth 2 -type d -exec test ! -e {}/metadb.sqlite3 \; -print

Find directories with 0 or 1 files:

  find -mindepth 2 -type d -exec sh -c 'test $(ls {} | wc -l) -le 1' \; -print

Delete directories with only metadb.sqlite3 files:
- afghanistan_records/2019-02-07-14-32-52
- chile_compra/2019-03-18-12-46-51
- honduras_sefin/2018-12-24-14-51-32
- indonesia_bandung/2019-03-13-14-38-40
- local_load/2018-11-27-12-12-19
- local_load/2018-11-27-12-12-58
- mexico_jalisco/2019-03-27-14-30-24
- paraguay_dncp/2018-11-20-14-01-38
- stralia_nsw/2019-01-16-16-51-49
- taiwan/2019-01-15-18-28-20
- uk_contracts_finder/2018-12-06-12-55-19
- uruguay/2018-10-01-12-58-58
- uruguay/2019-01-15-18-06-23

Look what's in the smallest directories (looks fine):

  ls $(du */* | sort -n | head | cut -f2 | xargs)

Find unclean collections

Check gather failures (none):

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(gather_success) FROM session") -eq 1' \; -print | sort

Check fetch failures (many):

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(fetch_success) FROM session") -eq 1' \; -print | sort

Look into specific failures. Some are "COULD NOT SEND TO SERVER! HTTP CODE: 500". Browsing Kingfisher Collect's code from that time, this error occurs if the response from Kingfisher Process is an error (no problem). Erase it:

  find . -name metadb.sqlite3 -exec sqlite3 {} $'UPDATE filestatus SET fetch_errors = \'[]\', fetch_success = 1 WHERE fetch_success = 0 AND fetch_errors = \'["Exception(\'\'COULD NOT SEND TO SERVER! HTTP CODE: 500\'\',)"]\'' \;

  find . -name metadb.sqlite3 -exec sqlite3 {} 'UPDATE session SET fetch_success = NOT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)' \;

Look into specific failures, again:

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)") -eq 1' \; -print -exec sqlite3 {} 'SELECT DISTINCT(fetch_errors) FROM filestatus WHERE fetch_success = 0' \;

A lot are like Request exception (Code 500): 500 Server Error: Internal Server Error for url: <URL>, with unique URLs (actual errors). Hide those to see what the others are (hard to say if all are collection errors, but we'll assume so):
```
  find . -name metadb.sqlite3 -exec sh -c "test \$(sqlite3 {} \"SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0 AND fetch_errors NOT LIKE '[\\\"Request exception (Code %')\") -eq 1" \; -print -exec sqlite3 {} "SELECT DISTINCT(fetch_errors) FROM filestatus WHERE fetch_success = 0 AND fetch_errors NOT LIKE '[\"Request exception (Code %'" \;
```
- ./australia/2018-06-19-15-05-38/metadb.sqlite3
  - ["IntegrityError('(sqlite3.IntegrityError) UNIQUE constraint failed: filestatus.filename',)"]
- ./armenia/2018-06-19-14-21-06/metadb.sqlite3
  - ["('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))"]
- ./moldova/2019-01-16-16-42-03/metadb.sqlite3
  - ["must be str, not bytes"]
- ./australia_nsw/2019-01-16-16-52-07/metadb.sqlite3
  - ["must be str, not bytes"]
- ./ukraine/2018-11-07-16-48-03/metadb.sqlite3
  - ["must be str, not bytes"]
- ./colombia/2019-02-26-12-15-28/metadb.sqlite3
  - ["JSONDecodeError('Expecting value: line 1 column 2172 (char 2171)',)"]
- ./paraguay_dncp/2019-01-29-10-51-51/metadb.sqlite3
  - ["Connection error", "Request exception (Code 401): 401 Client Error: Unauthorized for url: https://www.contrataciones.gov.py:443/datos/api/v2/doc/ocds/record-package/328498"]
  - ["Connection error", "Request exception (Code 401): 401 Client Error: Unauthorized for url: https://www.contrataciones.gov.py:443/datos/api/v2/doc/ocds/record-package/348237"]

So, we have 18 unclean collections out of about 170:

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT NOT(fetch_success) FROM session") -eq 1' \; -print | sort

We can see the proportion of files with errors:

  find . -name metadb.sqlite3 -exec sh -c 'test $(sqlite3 {} "SELECT EXISTS(SELECT 1 FROM filestatus WHERE fetch_success = 0)") -eq 1' \; -print -exec sqlite3 {} "SELECT COUNT(CASE WHEN fetch_success = 0 THEN 1 END), COUNT(*) FROM filestatus" \;

Apply criteria

List the sources, collection dates, and sizes:

  find . -mindepth 2 -type d -exec du -h {} \; | awk '{ print $2 " " $1}' | sort

Within a given period (a calendar year, or a month in the final year of publication):
- Compare the number of errors in metadb.sqlite3 ("clean")
- Compare the sizes ("complete")
Keep the earliest clean, complete collection within a given year. See worksheet: https://docs.google.com/spreadsheets/d/1n1zssmtFkJfiTguyWcJoRpC0AwsP1Vtz3HUJoN--lLA/edit#gid=1430917502

Before starting the sheet, I had already started deleting some based on a less robust process:

mexico_administracion_publica_federal

keep 2018-06-14-13-58-17 (3.1G)
delete 2018-08-15-20-51-44 (3.2G)

uk_contracts_finder

keep 2018-06-07-13-17-30 (615M)
delete 2018-08-13-20-44-23 (662M)
delete 2018-08-15-18-45-48 (664M)

jpmckinney · 2020-08-06T16:56:35Z

Done! See spreadsheet for determinations of what was kept/deleted.

Renamed the directories to match the new style (this makes a lot of warnings, but it succeeds):

find . -mindepth 2 -type d -exec sh -c 'var={}; dir=${var#*/}; mv {} $(echo $dir | tr "/" "_" | sed "s/\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)/\1\2\3-\4\5\6/" | tr "-" "_"); rmdir ${dir%%/*}' \;

jpmckinney added the S: kingfisher Relating to the Kingfisher servers label May 12, 2020

jpmckinney added kingfisher-archive and removed S: kingfisher Relating to the Kingfisher servers labels May 20, 2020

jpmckinney closed this as completed Aug 6, 2020

jpmckinney mentioned this issue Aug 6, 2020

Data retention policy for crawled data #153

Closed

This was referenced Oct 27, 2020

Add support for metadb.sqlite open-contracting-archive/kingfisher-archive#47

Open

Skip directories with too many errors open-contracting-archive/kingfisher-archive#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kingfisher Archive: Disk clean up #150

Kingfisher Archive: Disk clean up #150

jpmckinney commented May 12, 2020 •

edited

Loading

jpmckinney commented May 12, 2020

robredpath commented Jul 27, 2020

jpmckinney commented Jul 27, 2020 •

edited

Loading

odscjames commented Jul 30, 2020

odscjames commented Jul 30, 2020 •

edited

Loading

jpmckinney commented Jul 30, 2020

robredpath commented Aug 3, 2020

odscjames commented Aug 4, 2020

jpmckinney commented Aug 4, 2020 •

edited

Loading

robredpath commented Aug 5, 2020

jpmckinney commented Aug 5, 2020 •

edited

Loading

jpmckinney commented Aug 5, 2020 •

edited

Loading

jpmckinney commented Aug 6, 2020 •

edited

Loading

Kingfisher Archive: Disk clean up #150

Kingfisher Archive: Disk clean up #150

Comments

jpmckinney commented May 12, 2020 • edited Loading

jpmckinney commented May 12, 2020

robredpath commented Jul 27, 2020

jpmckinney commented Jul 27, 2020 • edited Loading

odscjames commented Jul 30, 2020

odscjames commented Jul 30, 2020 • edited Loading

jpmckinney commented Jul 30, 2020

robredpath commented Aug 3, 2020

odscjames commented Aug 4, 2020

jpmckinney commented Aug 4, 2020 • edited Loading

robredpath commented Aug 5, 2020

jpmckinney commented Aug 5, 2020 • edited Loading

Preparation

Decompress archives

Look for unusual files

Remove duplicate directories

Merge the directories

jpmckinney commented Aug 5, 2020 • edited Loading

Apply criteria

Delete samples

Delete empty directories

Find unclean collections

Apply criteria

jpmckinney commented Aug 6, 2020 • edited Loading

jpmckinney commented May 12, 2020 •

edited

Loading

jpmckinney commented Jul 27, 2020 •

edited

Loading

odscjames commented Jul 30, 2020 •

edited

Loading

jpmckinney commented Aug 4, 2020 •

edited

Loading

jpmckinney commented Aug 5, 2020 •

edited

Loading

jpmckinney commented Aug 5, 2020 •

edited

Loading

jpmckinney commented Aug 6, 2020 •

edited

Loading