`fastmultigather` produces different results when searching against a normal zip-type database and a `Rocksdb` inverted index database #483

JiaweiShen1116 · 2024-10-17T23:11:13Z

Hi,

I was comparing the performance of fastmultigather when searching against a normal zip-type database and against a Rocksdb inverted index database. While running fastmultigather with Rocksdb database was much faster than with normal zip database, I noticed that running with Rocksdb database generated slightly fewer matches in my samples. More specifically, while most of the matches agreed between two runs, the numbers of matches whose unique_intersect_bp values are low are different. (e.g., I set --bp_threshold at 5000 for fastmultigather, then searching against a Rocksdb database produced less matches whose unique_intersect_bp values are 5000 than searching against a normal zip database).

I wonder why this happened. Was it due to the different searching algorithms?

Please let me know what further information I can provide for this case to better investigate this issue. Thank you.

The text was updated successfully, but these errors were encountered:

ctb · 2024-10-18T00:09:22Z

hi @JiaweiShen1116, I had to dive deeply into this when working on validating gather across the various implementations we have over in https://github.com/ctb/2024-debug-gather-difference. It turns out to be remarkably difficult 😂 . See #331 (comment) for an example.

The short version is that there may be multiple perfectly valid and equivalently "good" sets of gather matches to a particular metagenome; this because the order in which gather matches are chosen can be somewhat arbitrary, but then this arbitrary choice affects the downstream results 😅 .

I'm traveling right now, but I'll give you some code to validate the different sets of gather results when I get back. I am reasonably confident that both sets of results are equally correct (and should give the same taxonomic results, for example) but it would be good to have a way to check!

JiaweiShen1116 · 2024-10-18T19:18:08Z

@ctb Thanks for the explanation. I do have another question regarding the gather outputs. Currently we still use the base version of sourmash in our pipelines, and we have some matches filters built upon the values in the potential_false_negative column of the gather output. However, this column is not present in the fastmultigather output.

Can you give a brief explanation of what this potential_false_negative column is aiming to imply about the matches? Is this column going to be added back to the fastmultigather output in future updates, or this column is deprecated because it doesn't contain as "important" information as expected? Thanks.

ctb · 2024-10-18T19:22:15Z

@bluegenes mentioned to me that the potential_false_negative column is not a relevant column in practice, but I do not have a clear memory of why. I thought we'd documented it somewhere. In any case, it is deprecated and not suggested - we won't be adding it into branchwater.

JiaweiShen1116 · 2024-10-18T22:51:10Z

@ctb Thanks for explaining. Upon further testing, I noticed that searching against Rocksdb database with fastmultigather will always return only one match whose unique_intersect_bp value equals the --bp-threshold parameter value in the command line in each sample.

For reference, here are some of the gather output csv files of the same sample searched against Rocksdb database or standard zip database and at two different --bp-threshold values.
zr17780_1.gather.k51.5000.rocksdb.csv
zr17780_1.gather.k51.5000.standard.csv
zr17780_1.gather.k51.10000.rocksdb.csv
zr17780_1.gather.k51.10000.standard.csv

It seems to be more of a systematic issue with searching against Rocksdb database with fastmultigather . Please let me know if you have encountered similar issues like this before and if there's a possible explanation for this. Thanks.

ctb · 2024-10-24T16:06:00Z

sorry, still haven't gotten to writing up details evaluation stuff - hope to do so soon!

in the meantime, two thoughts -

first, the output of the summarize-gather.py script from here should be essentially identical for the gather results from different implementations. This is guaranteed by the algorithm - so if they are different, this would be one big indication opf a bug! Can you try running it? lmk if you need help.

second, one approach to validation (that I will go into in more detail soon, promise) is to use one set of gather output as a picklist for sourmash gather. Basically you would take the fastmultigather CSV and do something like sourmash gather --picklist fmg.csv::prefetch ... and you should get the same results from sourmash gather that you got from fmg, if they are correct. Vice versa, you can take the sourmash gather results and run sourmash sig cat pygather.csv::gather <database> -o gather_subset.sig.zip and then run fastmultigather against that - you should see identical results. These are ways of forcing the different gather results to use the identical set of results as a database; if they don't work, there's something rotten in the state of Denmark (i.e. a likely bug).

last, you can check out our validation script/repo here, https://github.com/ctb/2024-debug-gather-difference/blob/main/do-compare-gather.py, where we cross-compare results from different gather implementations on a specific data set. There are Reasons why this needs to be done carefully so if it doesn't just work, it's not you, it's my instructions 😭

JiaweiShen1116 · 2024-10-29T19:34:25Z

@ctb I have run the summarize-gather.py on both sets of gather results using standard zip database or Rocksdb database and attached the summary results here. You can see that the gather results are indeed different between two sets, with the one searched against the Rocksdb database having smaller number of matches across all samples. As a result, it could be likely that there is some bug in the fastmultigather when searching against the Rocksdb database.

zr17780.all.standard.summary.k51.10000.csv
zr17780.all.rocksdb.summary.k51.10000.csv

However, when I run the do-compare-gather.py on all samples between two sets of gather results, it shows that the two gather csv files contain identical core columns for all samples, despite the fact that the gather summary results shown above are clearly different. Could you check it and let me know why this happened? Thank you.

ctb · 2024-11-04T16:08:05Z

dipping my toes into this - I can confirm that fastmultigather against rocksdb will return equal results just fine. So that's not intrinsically a problem in the code. More to do here ;)

ctb · 2024-11-10T03:45:46Z

hi @JiaweiShen1116 I finally tracked down at least one problem that could be causing what you see 😰 - fastmultigather is not respecting --scaled properly when it is set on the command line and differs from the scaled value in the sketches being compared. So I am wondering if maybe you were specifying scaled explicitly on the command line?

See #505 for a more detailed bug description.

JiaweiShen1116 · 2024-11-11T19:33:06Z

hi @JiaweiShen1116 I finally tracked down at least one problem that could be causing what you see 😰 - fastmultigather is not respecting --scaled properly when it is set on the command line and differs from the scaled value in the sketches being compared. So I am wondering if maybe you were specifying scaled explicitly on the command line?

See #505 for a more detailed bug description.

Hi @ctb. Yes, I can confirm that I was specifying the scaled parameter explicitly when running fastmultigather.

ctb · 2024-11-11T19:36:39Z

fantastic - then that at least gives me hope that it was #505 that caused the problem, and not some shiny new bug 😆 . I'll ping you when I release a new version with a fix!

ctb · 2024-11-12T00:29:59Z

whoops, did not mean to close that!

I've just released sourmash_plugin_branchwater v0.9.10, which contains the fix to #505 that's in #504. If you get a chance to try it out, please let me know!

Note that gather works on the unweighted k-mers, so the ref_f_unweighted (as output by summarize-gather.py) should sum to something identical independent of the gather implementation. I have noticed that the RocksDB implementation returns different numbers for the ref_f_weighted sometimes.

ctb mentioned this issue Nov 11, 2024

MRG: fix sketch downsampling across the plugin, incl fastmultigather, manysearch, and multisearch #504

Merged

ctb closed this as completed in #504 Nov 11, 2024

ctb reopened this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fastmultigather` produces different results when searching against a normal zip-type database and a `Rocksdb` inverted index database #483

`fastmultigather` produces different results when searching against a normal zip-type database and a `Rocksdb` inverted index database #483

JiaweiShen1116 commented Oct 17, 2024

ctb commented Oct 18, 2024

JiaweiShen1116 commented Oct 18, 2024

ctb commented Oct 18, 2024

JiaweiShen1116 commented Oct 18, 2024

ctb commented Oct 24, 2024

JiaweiShen1116 commented Oct 29, 2024

ctb commented Nov 4, 2024

ctb commented Nov 10, 2024

JiaweiShen1116 commented Nov 11, 2024

ctb commented Nov 11, 2024

ctb commented Nov 12, 2024

fastmultigather produces different results when searching against a normal zip-type database and a Rocksdb inverted index database #483

fastmultigather produces different results when searching against a normal zip-type database and a Rocksdb inverted index database #483

Comments

JiaweiShen1116 commented Oct 17, 2024

ctb commented Oct 18, 2024

JiaweiShen1116 commented Oct 18, 2024

ctb commented Oct 18, 2024

JiaweiShen1116 commented Oct 18, 2024

ctb commented Oct 24, 2024

JiaweiShen1116 commented Oct 29, 2024

ctb commented Nov 4, 2024

ctb commented Nov 10, 2024

JiaweiShen1116 commented Nov 11, 2024

ctb commented Nov 11, 2024

ctb commented Nov 12, 2024

`fastmultigather` produces different results when searching against a normal zip-type database and a `Rocksdb` inverted index database #483

`fastmultigather` produces different results when searching against a normal zip-type database and a `Rocksdb` inverted index database #483