-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ideas for improving retrieval through NCBI REST API #216
Comments
Note that in trying the test downloads for the above, downloading the dehydrated file was failing about 2/3 of the time. I wasn't using an API key, but it was also a single dehydrated file download. Need to assess failures with the above approach to see if it would be better to switch to EBI downloads. |
According to NCBI personnel,
See the issue for more details and the commands that I used ncbi/datasets#455 |
Actually, given that rate limits are not applied to rehydration and given the md5sum issues encountered in #222, it might be #222 would be better because we can get all md5sums at once, IF we can properly download md5sums for gzipped files. |
For testing/context, here is the cli version of the POST command to get
(api_key optional) I'm not experiencing any failures when testing the POST command. However, the |
handled by #222 |
After going down last week, the NCBI REST API is now discouraging individual genome download queries, and blocking IP addresses (see #215). Instead, they would prefer we download a dehydrated file first containing direct download links for all accessions, then fetch from those links. This is straightforward using the ncbi datasets tool, but that adds an extra step, as we would need to run that with all accessions first, extract the fetch links, and then use them within directsketch. An ideal solution would be to get the dehydrated file/fetch links directly from the REST API, since that is what the datasets tool is using anyway.
When using
datasets
, the fetch links are all saved to a file. Here's an example link: https://api.ncbi.nlm.nih.gov/datasets/fetch_h/R2V0UmVtb3RlRGF0YWZpbGU/eNqTyuXKzktOytROKymw0tdPT83Lz00t1k_MydF3d3bUNzAw0Dc0N9U3NTYF8eOBfCAXyNMzjHcM9gWzywzxSMWDTcxM1kvLS9RLrzJgtGAEADqLH0Y for GCA_000175535.1.After a little exploring, I can get a
fetch.txt
file inside of the zipfile directly from the api like so:where the critical part seems to be
include_annotation_type=SEQUENCE_REPORT
to create thefetch.txt
. The other annotation types are required to ensure those files have fetch links.After unzipping
nd.zip
:and the
fetch.txt
file looks like this:There is also a
POST
for genome/download that can useinclude_annotation_type=SEQUENCE_REPORT
, so that seems like an option for downloading many fetch links at once.I would hope they would again allow 10 simultaneous downloads of genome data using these direct fetch links, but I'm also not certain about that.
An approach to try:
download_with_retry
to fetch the the fasta files from the fetch links, limiting ton
simultaneous downloads. Investigate limit.The text was updated successfully, but these errors were encountered: