Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include md5sum in JSON or as other output #302

Closed
aboffin opened this issue Jan 4, 2024 · 3 comments
Closed

Include md5sum in JSON or as other output #302

aboffin opened this issue Jan 4, 2024 · 3 comments
Labels

Comments

@aboffin
Copy link

aboffin commented Jan 4, 2024

Hi,

Thank you for your team's commendable work on datasets which finally provides a comprehensive and singular way to download data from NCBI, whereas previously one had to resort to a multitude of EUtils/Perl/Python scripts that output something almost, but not quite entirely unlike what we wanted, however reliability seems to be an issue as with other tools.

Is there a way to check the integrity of the downloads? In the typical example that is given, this information does not exist:

./datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
unzip human_GRCh38_dataset.zip -d GRCh38
./datasets rehydrate --directory GRCh38

cd GRCh38/ncbi_dataset/data
grep md5 *json
# outputs nothing

I am perplexed that such a simple mechanism of checksum integrity was not provided considering that networks do fail and partial downloads may lead to, at best confusion and at worst incorrect results, when using such genomes for further analyses.

I see that issue #206 raised the same question but it was closed without any definitive answer regarding md5sum.

@olearyna
Copy link
Contributor

olearyna commented Jan 5, 2024

Hi aboffin<

Thanks for highlighting this issue. I understand this is an important feature. The NCBI Datasets team is actively exploring the implementation of a checksum mechanism. I'll leave this issue open until it is addressed.

All the best,
Nuala

Nuala A. O'Leary, PhD
Product Owner, NCBI Datasets
National Center for Biotechnology Information, NLM, NIH, DHHS

@ericcox1
Copy link
Collaborator

Hi @aboffin,

We have added MD5 checksum files to our data packages, including dehydrated packages, as of October 2024.

For your example, you could download the data and then validate the downloaded files as follows:

# Download a dehydrated data package
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip

# Unzip the downloaded package
unzip human_GRCh38_dataset.zip -d GRCh38

# Rehydrate the package to download the genomic sequence file
datasets rehydrate --directory GRCh38

# Change your working directory to the directory containing the extracted archive
cd GRCh38

# Use the Linux tool md5sum to calculate the checksums for each file and compare them to the MD5 hash values in md5sum.txt
md5sum -c md5sum.txt
ncbi_dataset/data/assembly_data_report.jsonl: OK
ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna: OK
ncbi_dataset/data/dataset_catalog.json: OK

The text "OK" at the end of each line of output indicates that the calculated MD5 hash values match the hash values included in the file md5sum.txt.

We have some more more information about this in our documentation: User-initiated validation using the MD5 checksum file

Thanks again for opening this issue.

Best,
Eric

@fgvieira
Copy link

@ericcox1

Thanks for including the md5 checksum, but does it also work with --gzip outputs?
In my case, the md5sum.txt file has the checksum but for the uncompressed files.

Ideally, it would be nice if the checksum would be performed automatically by datasets after each download and, if it did not match, retry (maybe enabled at command line, since it can take a bit of time). When downloading a lot of genomes, sometimes 5 or 6 fail. But if I try to rehydrate again, it does not do anything (I guess because the files are already there). In these cases I have to manually delete them and run rehydrate again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants