Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ideas for improving retrieval through NCBI REST API #216

Closed
bluegenes opened this issue Mar 5, 2025 · 5 comments
Closed

ideas for improving retrieval through NCBI REST API #216

bluegenes opened this issue Mar 5, 2025 · 5 comments

Comments

@bluegenes
Copy link
Collaborator

bluegenes commented Mar 5, 2025

After going down last week, the NCBI REST API is now discouraging individual genome download queries, and blocking IP addresses (see #215). Instead, they would prefer we download a dehydrated file first containing direct download links for all accessions, then fetch from those links. This is straightforward using the ncbi datasets tool, but that adds an extra step, as we would need to run that with all accessions first, extract the fetch links, and then use them within directsketch. An ideal solution would be to get the dehydrated file/fetch links directly from the REST API, since that is what the datasets tool is using anyway.

When using datasets, the fetch links are all saved to a file. Here's an example link: https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bUNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEADqLH0Y for GCA_000175535.1.

After a little exploring, I can get a fetch.txt file inside of the zipfile directly from the api like so:

curl -X GET "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCA_000175535.1/download?&include_annotation_type=GENOME_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=DATA_REPORT_ONLY" --output nd.zip

where the critical part seems to be include_annotation_type=SEQUENCE_REPORT to create the fetch.txt. The other annotation types are required to ensure those files have fetch links.

After unzipping nd.zip:

├── README.md
├── md5sum.txt
├── ncbi_dataset
│   ├── data
│   │   ├── assembly_data_report.jsonl
│   │   └── dataset_catalog.json
│   └── fetch.txt
└── nd.zip

and the fetch.txt file looks like this:

https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bUNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEADqLH0Y       0       data/GCA_000175535.1/GCA_000175535.1_ASM17553v1_genomic.fna
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/QXNzZW1ibHlEYXRhc2V0SW50ZXJuYWwuR2V0U2VxdWVuY2VSZXBvcnQ/eNrj4nd3dow3MDAwNDc1NTbVMwQAHicDbQ        0       data/GCA_000175535.1/sequence_report.jsonl

here, the prot_fasta doesn't exist, so we didn't get a fetch link for it. Note, however that there is a protein fasta for the GCF version of this accession (see https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/175/535/GCF_000175535.1_ASM17553v1/).
If we make the request with the GCF version of the accession, the fetch file is complete:

https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bTNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEAD5uH1U       0       data/GCF_000175535.1/GCF_000175535.1_ASM17553v1_genomic.fna
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bTNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMUXFOWXpGbm6aUlJuqlVxkwWjACAD_nH2c   0       data/GCF_000175535.1/protein.faa
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/QXNzZW1ibHlEYXRhc2V0SW50ZXJuYWwuR2V0U2VxdWVuY2VSZXBvcnQ/eNrj4nd3dos3MDAwNDc1NTbVMwQAHmgDcg        0       data/GCF_000175535.1/sequence_report.jsonl

This behavior is the same as existing, see #129.

There is also a POST for genome/download that can use include_annotation_type=SEQUENCE_REPORT, so that seems like an option for downloading many fetch links at once.

I would hope they would again allow 10 simultaneous downloads of genome data using these direct fetch links, but I'm also not certain about that.

An approach to try:

  • use POST for genome/download to get fetch links, information on genomic fasta, protein fasta for all accessions. This should also give us md5sums.
  • use download_with_retry to fetch the the fasta files from the fetch links, limiting to n simultaneous downloads. Investigate limit.
@bluegenes
Copy link
Collaborator Author

Note that in trying the test downloads for the above, downloading the dehydrated file was failing about 2/3 of the time. I wasn't using an API key, but it was also a single dehydrated file download. Need to assess failures with the above approach to see if it would be better to switch to EBI downloads.

@ccbaumler
Copy link

According to NCBI personnel,

  • large downloads from the NCBI to be set through the --inputfile arg. This is easy because that is what we are already passing into directsketch.
  • This method using datasets allowed for ~300,000 to be downloaded in ~6 min. (And then it broke...)
  • Followed with the rehydrate command on the unzipped directory, you can set --max-workers (which defaults to 10) to the max limit of 30. From the conversation with the maintainer No, rate-limits are not applied to rehydration. I set my api key in the command anyway but you can use all the workers without worrying about rate-limits.

See the issue for more details and the commands that I used ncbi/datasets#455

@bluegenes
Copy link
Collaborator Author

bluegenes commented Mar 10, 2025

Actually, given that rate limits are not applied to rehydration and given the md5sum issues encountered in #222, it might be
good to bring back the original ftp_path download method. This was previously slow because I was limiting to 3 simultaneous downloads, but that seems like it's not necessary...

#222 would be better because we can get all md5sums at once, IF we can properly download md5sums for gzipped files.

@bluegenes
Copy link
Collaborator Author

bluegenes commented Mar 12, 2025

For testing/context, here is the cli version of the POST command to get md5sum.txt and data/fetch.txt:

curl -X POST https://api.ncbi.nlm.nih.gov/datasets/v2/genome/download \
    -H "Content-Type: application/json" \
    -d '{
        "accessions": ["GCF_000175535.1", "GCF_000175536.1"],
        "include_annotation_type": ["GENOME_FASTA","PROT_FASTA"],
        "hydrated": "DATA_REPORT_ONLY",
        "api_key": ###,
    }' -v --output nd.zip

(api_key optional)

I'm not experiencing any failures when testing the POST command. However, the md5sum.txt contains md5sums for uncompressed FASTA files, and fetch.txt links download gzipped files by default.

@bluegenes
Copy link
Collaborator Author

handled by #222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants