adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

ctb · 2020-04-28T17:57:11Z

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

The basic idea here is that we combine taxonomy with search and gather on signatures/SBTs/LCAs, by connecting identifiers with lineages -- do dynamically what sourmash lca index does once.

This falls under use cases of taxonomic filtering and reporting.

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

It might be possible to add taxonomy/lineage-aware selector functionality on signatures and databases, e.g. "run a gather on this database, but ignore all results that are not Archaea."

Should also look at sourmash lca gather to see about collapsing results taxonomically...

Thoughts welcome!

related conversations

the ZipStorage work in [MRG] add ZipStorage, support loading tree from storage #648 makes it plausible to package SBTs up together with a lineage.
related conversation in this comment in #581

The text was updated successfully, but these errors were encountered:

luizirber · 2020-04-29T05:34:14Z

(Sorry, wrong button)

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

Another example: this is the script I'm using for thesis work and 2020-cami: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py

It's using pandas (for summarizing at different taxonomic levels) and the taxonomy package for parsing taxonomies, and I avoided bringing this into sourmash because they are heavy-ish deps, but having a sourmash taxonomy subcommand for working with similar tasks (and bringing the lineage parsers necessary for LCA databases, for example) seems a good idea.

(Converting from an SBT to an LCA index like in https://github.com/luizirber/2020-cami/blob/f7f34d3903cd7d2b6bc2ac0471f7d53a42aa86b2/rules/build_indices.smk#L64L66 without depending on external code would be pretty cool. I guess the revindex part is already doable today?)

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

Totally agree, and I think it also hurts functionality. Deposited genomes in genbank or refseq don't change (but they might get new versions), but the accession to taxid mapping CAN change (see: recent Lactobacillus genera being split).

It is so hard to deal with the NCBI taxonomy because it changes and doesn't have stable downloads for a specific date, so while generating taxinfo and putting it into SBTs has the benefit of capturing the taxonomy at one point in time (and being very convenient to use), but the drawback of potentially being outdated.

the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.

Yup, but would also like to see some option to override it somehow (re: outdated taxonomies)

ctb · 2020-04-29T14:38:14Z

OK, a few more thoughts and some summarization --

splitting LCA/SBT hash storage from lineage storage as per Implement more/better revindex functionality on top of LCA databases. #581 is a good idea. this would let us separate issues of hash storage (DNA vs protein vs dayhoff vs hp, skip-mers, abundances, and so on) out from lineage manipulation.
adding taxonomy utilities as a separate submodule == good. functionality could include:
- importing NCBI and Greengenes/GTDB taxonomies; ref NCBI import utilities, link in some of the various convenience scripts for dealing with genbank, etc. to the docs #397. Maybe also LIN / genomeRxiv relevant.
- taxonomic database exploration (basic keyword search, "how many lineages", etc)
- taxonomic summarization of search and gather output
- taxonomic filtering/collapse of search and gather output (discussed in Layer phylogenetic options on to sourmash lca gather? #583 and add taxonomic level limit filters to sourmash lca summarize #817)
wherever we include taxonomy, allow overriding it to provided updated identifier -> lineage functionality. this would be a benefit of the separated lineage/taxonomy database proposed in Implement more/better revindex functionality on top of LCA databases. #581
sourmash taxonomy code could support common functionality for contamination detection/removal research projects such as sourmash-oddify, charcoal, and long read assembly decontam. Basically, we're talking about making nicer public APIs per sourmash.lca current functionality, e.g. Add some lineage manipulation functionality to lca_utils #458. Seems on the edge of scope but our genomeRxiv funding will need this, perhaps.
more/better "human-focused output" per Provide a utility for human investigation of taxonomic results #265 is a nice goal, hard to know how to interact with NCBI and GTDB unless we can find a perma-URL identifier scheme. Presumably we'd want to add lots of live web page tests or something.

Out of scope --

identifier/source schemas, per provide a mechanism for tracking the source of genomes in SBTs #230

ctb · 2020-04-29T15:31:28Z

ref taxonkit https://twitter.com/byuhobbes/status/1255513435057643532

ctb · 2020-04-29T17:15:05Z

ref anvio, https://twitter.com/merenbey/status/1255542732505956352

dkoslicki · 2020-04-29T20:58:57Z

ETE3 is a great way for filtering/manipulating/etc. NCBI taxonomy that I've used FWIW

ctb · 2020-05-03T14:21:05Z

another piece of taxonomy related functionality - newick output, ref #915

ctb · 2020-05-04T22:16:42Z

random thoughts --

there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
check to make sure what output krona and other programs (metacoder, etc.) take in, make sure we output that.

luizirber · 2020-05-04T22:25:38Z

there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.

That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)

ctb · 2020-05-04T22:28:24Z

On Mon, May 04, 2020 at 03:25:51PM -0700, Luiz Irber wrote: > there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does _after_ a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful. That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)

yep - we have plenty of examples of this (sourmash lca summarize, for example). What I'm suggesting here most specifically is that we provide scripts that do this _after_ gather is run, and separate out the "search for genomes" from the "summarize taxonomy" functionality in a way that currently is only done in ad hoc scripts in other projects. I have a prototype of this myself, over in charcoal.

luizirber · 2020-05-05T01:55:56Z

yep - we have plenty of examples of this (sourmash lca summarize, for example). What I'm suggesting here most specifically is that we provide scripts that do this after gather is run, and separate out the "search for genomes" from the "summarize taxonomy" functionality in a way that currently is only done in ad hoc scripts in other projects. I have a prototype of this myself, over in charcoal.

But... that's what I described 🤣

This is exactly what happens in 2020-cami: I run gather first, and then use the gather output CSV to summarize taxonomy (taking a pre-calculated accession-to-taxid file for a specific database, because it takes a long time to process the full acc2taxid every time).
Here is the order they are called in the sourmash biobox:
https://github.com/sourmash-bio/bioboxes/blob/bf27b93d86fccc34540bf709996abc3076c00122/task_sourmash.pl#L21

ctb · 2020-05-05T13:53:04Z

On Mon, May 04, 2020 at 06:56:09PM -0700, Luiz Irber wrote: But... that's what I described :rofl:

fine, fine.

ctb · 2020-05-07T14:45:00Z

I've started playing around with separating revindex and lineage information, over in dib-lab/charcoal, just_taxonomy.py.

I created a LineageDB class that just manages the lineages, and then refactored a few functions to use that instead of the internal lineage handling in LCA_Database.

A few observations from this strategy -

we need some new functions on LCA_Database - in particular, get_idents_for_hashval - to provide intermediaries. That can be added without a new major version of sourmash.
it's not immediately clear to me how to handle loading/saving of lineage DBs to JSON, and if/how to encapsulate LineageDB within LCA_Database, but I have some thoughts. Will play.

ctb · 2020-05-23T14:20:01Z

here, from silva, is a nice example of multiple taxonomies being available in a database.

With SILVA release 138 the Opens external link in new windowGenome Taxonomy Database (GTDB) has been adopted (Parks 2018). As a consequence of our efforts the following groups were prone to significant adaptations: Archaea, Enterobacterales, Deltaproteobacteria, Firmicutes, Clostridia. Betaproteobacteriales (formerly known as Betaproteobacteria) is now Burkholderiales, an order of Gammaproteobacteria. Epsilonproteobacteria vanishes within a new phylum Campilobacterota. Tenericutes are gone, they are now all part of Bacilli, inside Firmicutes.

Additionally, every sequence in the SILVA datasets carries the EMBL-EBI/ENA taxonomy assignment. Where available, the RDP and GTDB taxonomies are added for comparison. In releases <138 also the greengenes taxonomy was added.

ctb · 2020-05-29T13:07:19Z

mmseqs2 does nice stuff with multiple taxonomies, it looks like - search for "taxonomy" on https://github.com/soedinglab/mmseqs2/wiki

ctb · 2020-07-04T15:04:32Z

note gather-to-tax.py

luizirber · 2020-08-07T00:04:57Z

I was talking with @bluegenes and after checking @erikyoung85 work in #1131 and the LineageDB from charcoal that @ctb mentioned, I would like to propose a few things:

Create TaxInfo, which will be pretty close to what LineageDB is doing in charcoal.
I think we need to keep lid_to_lineage and ident_to_lid in the saved files, and reconstruct/cache other useful derivations (lineage_to_lid, the inverse of lid_to_lineage, is one example).
SBT and LCA indices grab info from TaxInfo by ident. They are free to use other mappings they need (ident_to_name, idx_to_lid in the LCA case, for example).
Query TaxInfo by ident or by lid. Ideally just by ident, to avoid consistency problems if an ident -> lid mapping changes.
TaxInfo objects should be optional in an index.

On the Rust side it would be something like

type LineagePairs = BTreeMap<String, String>;

pub struct TaxInfo {
    lid_to_lineage: HashMap<u32, LineagePairs>
    ident_to_lid: HashMap<String, u32>
}

impl TaxInfo {
  fn get_lineage(&self, ident: &str) -> Option<LineagePairs> {}
}

And I think it can be serialized/deserialized with some processing to the current format.

On the Python side it will be similar to LineageDB.

Feature-wise, I think it is also important to be able to override a TaxInfo in the command line (or provide a different one for Index in Python/Rust). Especially since we "precompute" some desired ranks for LCA today, and they might not match other use cases (the CAMI profile format allows providing your own rank, for example). Critically, the important info in the TaxInfo class is ident, and as long as the idents are the same then two different TaxInfo objects can be interchangeable. This is a very lax definition, but I think it also fits well with other taxonomies (since, fundamentally, LineagePairs can be any list of (rank, value) that the user desires.

luizirber · 2020-08-07T15:17:49Z

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?

$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}

Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

ctb · 2020-08-10T20:31:45Z

re #969 (comment), I am ok with doing the Rust API so that that is easy, but I would suggest getting the current functionality oxidized and merged first before a split. Otherwise you end up with too many moving parts.

ctb · 2020-08-10T20:59:36Z

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?
$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}
Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

yes, lineage is optional in LCA databases.

ctb · 2021-04-23T17:29:09Z

random aside - splitting revindex and taxonomy would mean that we can update taxonomy without updating revindex, which currently is impossible in LCA databases.

ctb · 2021-04-23T17:30:08Z

(an interesting short research paper might be to compare and contrast gather and LCA approaches on the same database directly)

(and also GTDB and NCBI taxonomy result comparisons by finding the same k-mers, then applying different taxonomies)

ctb · 2022-01-26T15:00:05Z

In #1808, I'm adding SqliteIndex, which could also support LCA-style functionality as well as storing taxonomy information more generally, all in the same file; revisiting #1651, this is just the taxonomy table defined in tax/tax_utils.py.

luizirber closed this as completed Apr 29, 2020

luizirber reopened this Apr 29, 2020

luizirber mentioned this issue May 5, 2020

check genome_updater for genbank/refseq downloads sourmash-bio/databases#5

Closed

This was referenced May 8, 2020

how might we distribute "diff" or patch databases? #985

Open

configuration file and user-provided lineages dib-lab/charcoal#24

Closed

ctb mentioned this issue May 23, 2020

factor out LineageDB from charcoal / add support functions to sourmash dib-lab/charcoal#96

Open

ctb mentioned this issue Jun 20, 2020

Layer phylogenetic options on to sourmash lca gather? #583

Closed

ctb mentioned this issue Jul 4, 2020

support exporting lineage spreadsheet from LCA databases #1080

Open

ctb mentioned this issue Jul 13, 2020

implement a sourmash classify command? #1099

Closed

luizirber mentioned this issue Jul 17, 2020

Building genbank/refseq databases from assembly_summary.txt sourmash-bio/databases#7

Closed

ctb mentioned this issue Jul 17, 2020

build a streamlined Genbank/RefSeq based on GTDB 25k genomes sourmash-bio/databases#8

Closed

luizirber mentioned this issue Aug 22, 2020

Oxidize lca_db #1131

Closed

5 tasks

ctb mentioned this issue Feb 27, 2021

[MRG] implement a simple ZipFileLinearIndex class #1349

Merged

5 tasks

ctb mentioned this issue Apr 23, 2021

roadmap thoughts for 4.1 and 4.2 #1481

Closed

4 tasks

taylorreiter mentioned this issue May 11, 2021

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

Closed

ctb added the revisit_me An issue that needs attention and clarification label Jun 26, 2021

ctb mentioned this issue Aug 13, 2022

translating between taxonomies - maybe a sourmash tax translate? #2201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

ctb commented Apr 28, 2020

luizirber commented Apr 29, 2020

ctb commented Apr 29, 2020 •

edited

Loading

ctb commented Apr 29, 2020

ctb commented Apr 29, 2020

dkoslicki commented Apr 29, 2020

ctb commented May 3, 2020

ctb commented May 4, 2020

luizirber commented May 4, 2020

ctb commented May 4, 2020 via email

luizirber commented May 5, 2020

ctb commented May 5, 2020 via email

ctb commented May 7, 2020

ctb commented May 23, 2020

ctb commented May 29, 2020

ctb commented Jul 4, 2020

luizirber commented Aug 7, 2020 •

edited

Loading

luizirber commented Aug 7, 2020 •

edited

Loading

ctb commented Aug 10, 2020

ctb commented Aug 10, 2020

ctb commented Apr 23, 2021

ctb commented Apr 23, 2021

ctb commented Jan 26, 2022

adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

Comments

ctb commented Apr 28, 2020

related conversations

luizirber commented Apr 29, 2020

ctb commented Apr 29, 2020 • edited Loading

ctb commented Apr 29, 2020

ctb commented Apr 29, 2020

dkoslicki commented Apr 29, 2020

ctb commented May 3, 2020

ctb commented May 4, 2020

luizirber commented May 4, 2020

ctb commented May 4, 2020 via email

luizirber commented May 5, 2020

ctb commented May 5, 2020 via email

ctb commented May 7, 2020

ctb commented May 23, 2020

ctb commented May 29, 2020

ctb commented Jul 4, 2020

luizirber commented Aug 7, 2020 • edited Loading

luizirber commented Aug 7, 2020 • edited Loading

ctb commented Aug 10, 2020

ctb commented Aug 10, 2020

ctb commented Apr 23, 2021

ctb commented Apr 23, 2021

ctb commented Jan 26, 2022

ctb commented Apr 29, 2020 •

edited

Loading

luizirber commented Aug 7, 2020 •

edited

Loading

luizirber commented Aug 7, 2020 •

edited

Loading