Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for viral-neighbors DB download #87

Open
jzrapp opened this issue Jan 20, 2022 · 5 comments
Open

Error for viral-neighbors DB download #87

jzrapp opened this issue Jan 20, 2022 · 5 comments

Comments

@jzrapp
Copy link

jzrapp commented Jan 20, 2022

Hi, I'm trying to download the viral-neighbors db from NCBI, but the download halts with this message. Could you help on that?

krakenuniq-download --db $DBDIR viral-neighbors  --threads 10
Downloading viral neighbors.
$DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.00 GB]
$DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.24 GB]
Reading names file ...
Downloading {"header":{"type":"esearch sequences into DBDIR/library/viral/Neighbors.
Use of uninitialized value $querykey in concatenation (.) or string at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download line 570.
Use of uninitialized value $webenv in concatenation (.) or string at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download line 570.
query_key=&webenv=
Argument "{"header":{"type":"esearch" isn't numeric in numeric lt (<) at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download

Thanks a lot!

@clescoat
Copy link

clescoat commented Apr 29, 2022

TL;DR: I attached the modified script krakenuniq-download.

The problem seems to be with the format of the json files returned by NCBI esearch. The krakenuniq-download script wants three informations from it : the number of returned hits ($n_res), and the $querykey and $webenv params to download the results with NCBI efetch.

The grep command used to get the three parameters doesn't work, hence :
$n_res = "{"header":{"type":"esearch"
$querykey and $webenv not unintialized.

So, I modified the script to get the right information. Here are the modified lines of code. I have absolutely no experience with perl so my hack may seem clumsy, if you have a more elegant please correct it !

my $n_res=`grep -Po '"count":"[0-9]+' $esearch_file | head -1 | grep -o '[[:digit:]]*'`; 
chomp $n_res;

print STDERR "Downloading $n_res sequences into $nbr_dir.\n"; #not modified
return if (!defined $n_res || $n_res eq 0); #not modified

# Step 2: Download FASTAs (or other formats), 10k at a time
my $querykey=`grep -Po '"querykey":"[[:digit:]]*"' $esearch_file | grep -o "[[:digit:]]*"`;
chomp $querykey;

my $webenv =`grep -Po '(?<=webenv":")(.*)(?=","idlist)' $esearch_file`;
chomp $webenv;

The esearch_file looks like this :

{"header":{"type":"esearch","version":"0.3"},"esearchresult":{"count":"191228","retmax":"1","retstart":"0","querykey":"1","webenv":"MCID_62698c86771ad064f637512f","idlist":["2131322834"],"translationset":[{"from":"viruses[organism]","to":""Viruses"[Organism]"},{"from":"cellular organisms[orgn]","to":""cellular organisms"[Organism]"}],"translationstack":[{"term":""Viruses"[Organism]","field":"Organism","count":"8537507","explode":"Y"},{"term":""cellular organisms"[Organism]","field":"Organism","count":"440430254","explode":"Y"},"NOT",{"term":"wgs[prop]","field":"prop","count":"114077876","explode":"N"},"NOT",{"term":"gbdiv syn[prop]","field":"prop","count":"565473","explode":"N"},"NOT",{"term":"nuccore genome samespecies[filter]","field":"filter","count":"233336","explode":"N"},{"term":"complete[title]","field":"title","count":"9205767","explode":"N"},"AND",{"term":"unverified[title]","field":"title","count":"425833","explode":"N"},"NOT","GROUP","AND"],"querytranslation":""Viruses"[Organism] NOT "cellular organisms"[Organism] NOT wgs[prop] NOT gbdiv syn[prop] AND (nuccore genome samespecies[filter] AND complete[title] NOT unverified[title])","warninglist":{"phrasesignored":["and"],"quotedphrasesnotfound":[],"outputmessages":[]}}}

Here are my logs with this correction :

(env) $ krakenuniq-download --threads 10 --db DBDIR refseq/viral/Any viral-neighbors

Downloading assembly summary file for viral genomes, and filtering to assembly level Any.
WARNING: DBDIR/library/viral already exists - potentially overwriting files.
 Downloading viral genomes:  11854/11854 ...   Found 11854 files. Skipped download of 11854 files that already existed.
Downloading viral neighbors.
Not fetching https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz - file DBDIR/taxonomy/nucl_gb.accession2taxid.gz exists.
DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.03 GB]
DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.32 GB]
Reading names file ...
Downloading 191228 sequences into DBDIR/library/viral/Neighbors.
query_key=1&webenv=MCID_626a95838302021f8e7861e6
  Downloading sequences 1 to 10000 of 191228 ...
 done 
  Downloading sequences 10001 to 20000 of 191228 ...
 done 
  Downloading sequences 30001 to 40000 of 191228 ...
 done 
  Downloading sequences 70001 to 80000 of 191228 ...
 done 
  Downloading sequences 150001 to 160000 of 191228 ...
 done 
(env)$ 

But wait, there is more ! We are downloading FASTA files by 10k batches, but I didn't understand why we jumped from batch 10k -> 20k to 30k -> 40k etc., so I modified the following lines as well :

Before (add the 1000, 2000, 40000, 80000... at the end of each loop):
$retstart += $curr_retmax;

After (add 10 000 at the end of each loop) :
$retstart += $retmax;

And the resulting logs :

Downloading assembly summary file for viral genomes, and filtering to assembly level Any.
WARNING: DBDIR/library/viral already exists - potentially overwriting files.
 Downloading viral genomes:  11854/11854 ...   Found 11854 files. Skipped download of 11854 files that already existed.
Downloading viral neighbors.
Not fetching https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz - file DBDIR/taxonomy/nucl_gb.accession2taxid.gz exists.
DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.03 GB]
DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.32 GB]
Reading names file ...
DBDIR/library/viral/Neighbors/esearch_res.jsonDownloading 191228 sequences into DBDIR/library/viral/Neighbors.
query_key=1&webenv=MCID_626a95838302021f8e7861e6
  Downloading sequences 1 to 10000 of 191228 ...
 done 
  Downloading sequences 10001 to 20000 of 191228 ...
 done 
  Downloading sequences 20001 to 30000 of 191228 ...
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=taxonomy&id=AC_000192
AC_000192: 11144 [/tmp/VL2eJBOTdq]
 done 
  Downloading sequences 30001 to 40000 of 191228 ...
 done 
  Downloading sequences 40001 to 50000 of 191228 ...
 done 
  Downloading sequences 50001 to 60000 of 191228 ...
 done 
  Downloading sequences 60001 to 70000 of 191228 ...
 done 
  Downloading sequences 70001 to 80000 of 191228 ...
 done 
  Downloading sequences 80001 to 90000 of 191228 ...
 done 
  Downloading sequences 90001 to 100000 of 191228 ...
 done 
  Downloading sequences 100001 to 110000 of 191228 ...
 done 
  Downloading sequences 110001 to 120000 of 191228 ...
 done 
  Downloading sequences 120001 to 130000 of 191228 ...
 done 
  Downloading sequences 130001 to 140000 of 191228 ...
 done 
  Downloading sequences 140001 to 150000 of 191228 ...
 done 
  Downloading sequences 150001 to 160000 of 191228 ...
 done 
  Downloading sequences 160001 to 170000 of 191228 ...
 done 
  Downloading sequences 170001 to 180000 of 191228 ...
 done 
  Downloading sequences 180001 to 190000 of 191228 ...
 done 
  Downloading sequences 190001 to 191228 of 191228 ...
 done 

krakenuniq-download.zip

@jsonProgram
Copy link

Thanks clescoat!

@luigra
Copy link

luigra commented Aug 26, 2022

You saved my day!!!

Thanks a lot @clescoat !

Why don't you push this important upgrade to the current version of the script or ask @fbreitwieser or @DerrickWood to do it ?

@salzberg
Copy link
Collaborator

Thanks for this help, @clescoat!
Just FYI to everyone on this thread: Florian Breitwieser left my lab several years ago, and he no longer maintains Kraken-unique, but we've had many recent improvements led by Christopher Pockrandt and Aleksey Zimin. Aleksey is still an active member of the lab (a senior scientist) and Christopher has moved back to Germany, but continues to help maintain the code. The recent major upgrade (allowing the use of small-memory computers despite a very large DB) was developed by him.

You saved my day!!!

Thanks a lot @clescoat !

Why don't you push this important upgrade to the current version of the script or ask @fbreitwieser or @DerrickWood to do it ?

@alekseyzimin
Copy link
Collaborator

I will integrate the update now. Thank you, @clescoat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants