Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question on the flag '--inputfile' in the command 'datasets summary genome taxon' : taxid with no genome #450

Open
DongHRLZU opened this issue Feb 9, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@DongHRLZU
Copy link

Hello,

  I am a rookie for using your ncbi-datasets tools.
  Recently, I am using a taxid list as the options '--inputfile' to get the genome report when I use the command 'datasets summary genome taxon'. Yet, the question is that I cannot priorly know if each of the taxonomic IDs in my list can correspond an existing assembly genome or not. Thus, the programme always interrupted at a taxonomic ID without any known genome on NCBI .  
  How can I add a option to skip this taxonomic IDs without any genome corresponded in my workflow of getting genome report  and meantime output these ids to std err? I did not find the concerned in the help manual.
  Currently I have to give up the option '--inputfile' and in turn use a cyclic statement to solve it.

  Forward to your reply and help. 
@ericcox1 ericcox1 added the enhancement New feature or request label Feb 10, 2025
@ericcox1
Copy link
Collaborator

Hi @DongHRLZU,

Thanks for creating this issue. As you point out above, the current behavior is to abort if datasets encounters a taxid without genome data. Based on your feedback, we are going to change this behavior so it will still return genome data even if it encounters a taxid without genome data. This could take a little while due to competing priorities.

In the meantime, you may be interested in using the taxonomy data report counts data to check whether a particular taxid has genome data.

Given a list, tax.list, where 9606 and 10090 have genome data and 105513 does not, you can check the taxonomy data report to see whether genome data is available, and filter out taxids without genome data:

# Given a taxid list with a mixture of taxids with and without genome data
cat tax.list
9606
105513
10090

# Use datasets to check the taxonomy data report for genome assembly counts, then filter out taxids without genomes
datasets summary taxonomy taxon --inputfile tax.list | \
jq -r '.reports[].taxonomy | (if .counts[]?.type=="COUNT_TYPE_ASSEMBLY" then .tax_id else empty end)'
10090
9606

Best,
Eric

@DongHRLZU
Copy link
Author

Well,I didn't realize I can check the genome assembly counts at first by using the datasets summary taxonomy. Of course, the availability in genome for each taxid in the output of datasets summary genome can be more convenient and straightforward, I think. Anyway, thank you for taking my question seriously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants