-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: use correct denominator in f_unique_to_query #3138
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## latest #3138 +/- ##
==========================================
+ Coverage 86.62% 86.65% +0.02%
==========================================
Files 136 136
Lines 15818 15818
Branches 2713 2713
==========================================
+ Hits 13703 13707 +4
+ Misses 1805 1801 -4
Partials 310 310
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
this change looks good. There is a test for the I didn't originally realize - the revindex tests are here: sourmash/src/core/src/index/revindex/mod.rs Line 446 in fe4790f
that would be a good spot to add tests |
I did not know this: you have to specifically run the tests with (Our github actions does the right thing, I think. I just was struggling to get |
This confused me at first too + I regularly need to remind myself of cargo testing syntax. Do we have any sourmash rust dev docs yet? |
the code in |
note that when I started adding tests, I got problematic results per #3139 / sourmash-bio/sourmash_plugin_branchwater#322. The current tests do check the denominator issue in #3137 so I think we should merge it and leave #3139 to another PR. Let me know if you disagree @bluegenes ! |
Ready for review & (assuming tests pass) merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
…d by plugins (#3193) This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where RocksDB-based `gather` was incorrectly subtracting hashes multiple times from the counter in situations of high redundancy. For example, consider this Venn diagram of the 3-way intersection between three sketches: ![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a) When a metagenome contains the union of all three of these sketches, the broken implementation would subtract the `10` at the center multiple times. This was caused by removing hashes from the matches, rather than the intersection, each pass through the counter. Of note, this made RocksDB-based `fastmultigather` return incorrect results, ref sourmash-bio/sourmash_plugin_branchwater#322; first discovered in #3138 (comment). The PR fixes this, and adds a more complete pair of tests, based on `test_gather_metagenome_num_results` in the Python tests for sourmash. This PR also adjusts the hash function string for DNA sketches in Rust to be uppercase `DNA` rather than lowercase `dna`, ref sourmash-bio/sourmash_plugin_directsketch#49 And remember, it's not *just* the destination - it's the friends you make along the way, like `env_logger`. * Fixes #3139 * Fixes sourmash-bio/sourmash_plugin_directsketch#49 For consideration: Right now we are calculating the intersection twice, once in `disk_revindex.rs` and once in `calculate_gather_stats` in `revindex/mod.rs`. This is unnecessary, of course. But the function signature for `calculate_gather_stats` would need to change to take the intersection as an argument. We could: * keep calculating it twice, just for simplicity; * change `calculate_gather_stats` to take an _optional_ intersection, and calculate it if not provided; * change `calculate_gather_stats` to require an intersection. TODO: - [x] add at least one explicit test for the moltype fix Other notes: * there is a discrepancy between the Python (`sourmash gather`) and Rust (`sourmash_plugin_branchwater` results) calculations for `remaining_bp`. It seems to me like the Python one is definitely wrong; not yet sure about Rust. Viz #3194.
#3199) ## [0.14.0] - 2024-06-10 MSRV: 1.65 Changes/additions: * fix cargo fmt for updated `disk_revindex.rs` code (#3197) * fix RocksDB-based gather & other rust-based infelicities revealed by plugins (#3193) * use correct denominator in f_unique_to_query (#3138) * fix clippy warnings about max_value (#3146) * allow get/set record.filename (#3121) Updates: * Bump statrs from 0.16.0 to 0.16.1 (#3186) * Bump serde from 1.0.202 to 1.0.203 (#3175) * Bump ouroboros from 0.18.3 to 0.18.4 (#3176) * Bump itertools from 0.12.1 to 0.13.0 (#3166) * Bump camino from 1.1.6 to 1.1.7 (#3169) * Bump serde from 1.0.201 to 1.0.202 (#3168) * Bump serde_json from 1.0.116 to 1.0.117 (#3159) * Bump serde from 1.0.200 to 1.0.201 (#3160) * Bump roaring from 0.10.3 to 0.10.4 (#3142) * Bump histogram from 0.10.0 to 0.10.1 (#3141) * Bump num-iter from 0.1.44 to 0.1.45 (#3140) * Bump serde from 1.0.199 to 1.0.200 (#3144) * Bump serde from 1.0.198 to 1.0.199 (#3130) * Bump serde_json from 1.0.115 to 1.0.116 (#3124) * Bump serde from 1.0.197 to 1.0.198 (#3122) * Bump histogram from 0.9.1 to 0.10.0 (#3109) * Bump enum_dispatch from 0.3.12 to 0.3.13 (#3102) * Bump serde_json from 1.0.114 to 1.0.115 (#3101) * Bump rayon from 1.9.0 to 1.10.0 (#3098)
Minor new features: * add `--set-name` to `sig intersect` and `sig subtract` (#3162) * upgrade `sig overlap` and `sig subtract` to load more than JSON signatures (#3153) * force continue past `tax genome` classification errors (#3100) Bug fixes: * fix `remaining_bp` output from sourmash gather (#3195) * fix RocksDB-based gather & other rust-based infelicities revealed by plugins (#3193, #3197) * use correct denominator in f_unique_to_query (#3138) Cleanup and documentation updates: * update JOSS for sourmash v4 (#3114, #3203, #3209) * fix links to taxonomy spreadsheets (#3119) * fix description of `f_unique_weighted` (#3164) Developer updates: * transition internal signature loading functions (#3161) * allow get/set record.filename (#3121) * round a number that is losing precision in 15th place in `test_distance_utpy` (#3126) * disable ppc64le wheel building (#3127) * prepare to remove `sourmash compute` for sourmash v5.0 (#3103) * add rustup target x86_64-apple-darwin (#3148) * mv `.cargo/config` to `config.toml` (#3147) * fix clippy warnings about max_value (#3146) * bump to v4.8.9-dev (#3135) * update src/core/CHANGELOG.md for sourmash-rs core release r0.14.0 (#3199) Dependabot updates: * Bump DeterminateSystems/nix-installer-action from 11 to 12 (#3184) * Bump DeterminateSystems/magic-nix-cache-action from 6 to 7 (#3185) * Bump statrs from 0.16.0 to 0.16.1 (#3186) * Bump serde from 1.0.202 to 1.0.203 (#3175) * Bump ouroboros from 0.18.3 to 0.18.4 (#3176) * Bump itertools from 0.12.1 to 0.13.0 (#3166) * Bump camino from 1.1.6 to 1.1.7 (#3169) * Bump serde from 1.0.201 to 1.0.202 (#3168) * Bump thiserror from 1.0.60 to 1.0.61 (#3167) * Bump pypa/cibuildwheel from 2.18.0 to 2.18.1 (#3165) * Bump DeterminateSystems/magic-nix-cache-action from 4 to 6 (#3157) * Bump DeterminateSystems/nix-installer-action from 10 to 11 (#3156) * Bump pypa/cibuildwheel from 2.17.0 to 2.18.0 (#3155) * Bump serde_json from 1.0.116 to 1.0.117 (#3159) * Bump thiserror from 1.0.59 to 1.0.60 (#3158) * Bump serde from 1.0.200 to 1.0.201 (#3160) * Bump roaring from 0.10.3 to 0.10.4 (#3142) * Bump histogram from 0.10.0 to 0.10.1 (#3141) * Bump getrandom from 0.2.14 to 0.2.15 (#3143) * Bump num-iter from 0.1.44 to 0.1.45 (#3140) * Bump jinja2 from 3.1.3 to 3.1.4 (#3145) * Bump serde from 1.0.199 to 1.0.200 (#3144) * Bump serde from 1.0.198 to 1.0.199 (#3130) * Bump conda-incubator/setup-miniconda from 3.0.3 to 3.0.4 (#3131) * Update pytest requirement from <8.2.0,>=6.2.4 to >=6.2.4,<8.3.0 (#3132) * Bump myst-parser from 2.0.0 to 3.0.1 (#3133) * Bump thiserror from 1.0.58 to 1.0.59 (#3123) * Bump serde_json from 1.0.115 to 1.0.116 (#3124) * Bump serde from 1.0.197 to 1.0.198 (#3122) * Update docutils requirement from <0.21,>=0.17.1 to >=0.17.1,<0.22 (#3116)
This PR fixes an issue introduced in #2943 where we introduced a subtly broken calculation that uses the current size of the query metagenome as the denominator for the
f_unique_to_query
calculation.Fixes #3137
This PR also adds some commented-out test code that demonstrates #3139 / sourmash-bio/sourmash_plugin_branchwater#322. That's something I haven't been able to debug, so I'd suggest fixing that independently - I'd rather fix a problem now, rather than waiting until we can fix multiple problems at some later indeterminate time :).
Notes
linear.rs
? or just rename things per Erroneousf_unique_to_query
calculation in Rust code. #3137?