ideas for improving retrieval through NCBI REST API #216

bluegenes · 2025-03-05T20:04:22Z

After going down last week, the NCBI REST API is now discouraging individual genome download queries, and blocking IP addresses (see #215). Instead, they would prefer we download a dehydrated file first containing direct download links for all accessions, then fetch from those links. This is straightforward using the ncbi datasets tool, but that adds an extra step, as we would need to run that with all accessions first, extract the fetch links, and then use them within directsketch. An ideal solution would be to get the dehydrated file/fetch links directly from the REST API, since that is what the datasets tool is using anyway.

When using datasets, the fetch links are all saved to a file. Here's an example link: https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bUNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEADqLH0Y for GCA_000175535.1.

After a little exploring, I can get a fetch.txt file inside of the zipfile directly from the api like so:

curl -X GET "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCA_000175535.1/download?&include_annotation_type=GENOME_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=DATA_REPORT_ONLY" --output nd.zip

where the critical part seems to be include_annotation_type=SEQUENCE_REPORT to create the fetch.txt. The other annotation types are required to ensure those files have fetch links.

After unzipping nd.zip:

├── README.md
├── md5sum.txt
├── ncbi_dataset
│   ├── data
│   │   ├── assembly_data_report.jsonl
│   │   └── dataset_catalog.json
│   └── fetch.txt
└── nd.zip

and the fetch.txt file looks like this:

https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bUNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEADqLH0Y       0       data/GCA_000175535.1/GCA_000175535.1_ASM17553v1_genomic.fna
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/QXNzZW1ibHlEYXRhc2V0SW50ZXJuYWwuR2V0U2VxdWVuY2VSZXBvcnQ/eNrj4nd3dow3MDAwNDc1NTbVMwQAHicDbQ        0       data/GCA_000175535.1/sequence_report.jsonl

here, the prot_fasta doesn't exist, so we didn't get a fetch link for it. Note, however that there is a protein fasta for the GCF version of this accession (see https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/175/535/GCF_000175535.1_ASM17553v1/).
If we make the request with the GCF version of the accession, the fetch file is complete:
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bTNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEAD5uH1U       0       data/GCF_000175535.1/GCF_000175535.1_ASM17553v1_genomic.fna
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bTNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMUXFOWXpGbm6aUlJuqlVxkwWjACAD_nH2c   0       data/GCF_000175535.1/protein.faa
https://api.ncbi.nlm.nih.gov/datasets/fetch_h/QXNzZW1ibHlEYXRhc2V0SW50ZXJuYWwuR2V0U2VxdWVuY2VSZXBvcnQ/eNrj4nd3dos3MDAwNDc1NTbVMwQAHmgDcg        0       data/GCF_000175535.1/sequence_report.jsonl
This behavior is the same as existing, see #129.

There is also a POST for genome/download that can use include_annotation_type=SEQUENCE_REPORT, so that seems like an option for downloading many fetch links at once.

I would hope they would again allow 10 simultaneous downloads of genome data using these direct fetch links, but I'm also not certain about that.

An approach to try:

use POST for genome/download to get fetch links, information on genomic fasta, protein fasta for all accessions. This should also give us md5sums.
use download_with_retry to fetch the the fasta files from the fetch links, limiting to n simultaneous downloads. Investigate limit.

The text was updated successfully, but these errors were encountered:

bluegenes · 2025-03-05T20:26:53Z

Note that in trying the test downloads for the above, downloading the dehydrated file was failing about 2/3 of the time. I wasn't using an API key, but it was also a single dehydrated file download. Need to assess failures with the above approach to see if it would be better to switch to EBI downloads.

ccbaumler · 2025-03-10T18:55:07Z

According to NCBI personnel,

large downloads from the NCBI to be set through the --inputfile arg. This is easy because that is what we are already passing into directsketch.
This method using datasets allowed for ~300,000 to be downloaded in ~6 min. (And then it broke...)
Followed with the rehydrate command on the unzipped directory, you can set --max-workers (which defaults to 10) to the max limit of 30. From the conversation with the maintainer No, rate-limits are not applied to rehydration. I set my api key in the command anyway but you can use all the workers without worrying about rate-limits.

See the issue for more details and the commands that I used ncbi/datasets#455

bluegenes · 2025-03-10T23:52:43Z

Actually, given that rate limits are not applied to rehydration and given the md5sum issues encountered in #222, it might be
good to bring back the original ftp_path download method. This was previously slow because I was limiting to 3 simultaneous downloads, but that seems like it's not necessary...

#222 would be better because we can get all md5sums at once, IF we can properly download md5sums for gzipped files.

bluegenes · 2025-03-12T15:50:10Z

For testing/context, here is the cli version of the POST command to get md5sum.txt and data/fetch.txt:

curl -X POST https://api.ncbi.nlm.nih.gov/datasets/v2/genome/download \
    -H "Content-Type: application/json" \
    -d '{
        "accessions": ["GCF_000175535.1", "GCF_000175536.1"],
        "include_annotation_type": ["GENOME_FASTA","PROT_FASTA"],
        "hydrated": "DATA_REPORT_ONLY",
        "api_key": ###,
    }' -v --output nd.zip

(api_key optional)

I'm not experiencing any failures when testing the POST command. However, the md5sum.txt contains md5sums for uncompressed FASTA files, and fetch.txt links download gzipped files by default.

bluegenes · 2025-04-02T15:47:08Z

handled by #222

bluegenes mentioned this issue Mar 10, 2025

MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach #222

Merged

7 tasks

bluegenes closed this as completed Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ideas for improving retrieval through NCBI REST API #216

ideas for improving retrieval through NCBI REST API #216

bluegenes commented Mar 5, 2025 •

edited

Loading

bluegenes commented Mar 5, 2025

ccbaumler commented Mar 10, 2025

bluegenes commented Mar 10, 2025 •

edited

Loading

bluegenes commented Mar 12, 2025 •

edited

Loading

bluegenes commented Apr 2, 2025

ideas for improving retrieval through NCBI REST API #216

ideas for improving retrieval through NCBI REST API #216

Comments

bluegenes commented Mar 5, 2025 • edited Loading

bluegenes commented Mar 5, 2025

ccbaumler commented Mar 10, 2025

bluegenes commented Mar 10, 2025 • edited Loading

bluegenes commented Mar 12, 2025 • edited Loading

bluegenes commented Apr 2, 2025

bluegenes commented Mar 5, 2025 •

edited

Loading

bluegenes commented Mar 10, 2025 •

edited

Loading

bluegenes commented Mar 12, 2025 •

edited

Loading