Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to AllSome/Split sequence bloom trees? #545

Open
olgabot opened this issue Sep 17, 2018 · 2 comments
Open

Upgrade to AllSome/Split sequence bloom trees? #545

olgabot opened this issue Sep 17, 2018 · 2 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Sep 17, 2018

Are there any plans to upgrade the backend SBT code to use AllSome or Split sequence bloom trees, as they're faster in both creating and searching the indices?

They've been implemented, but the licenses isn't compatible with the BSD.

I see there's some interest in using Rust for the backend, and this might be a nice excuse to play around in the language. I'm happy to help where I can!

@phiweger
Copy link

The SBT data structure seems less optimal given recent work by Zamin Iqbal's group on a Bitsliced Genomic Signature Index (BIGSI). They report that they need 4 orders of magnitude less space for the index. And you can incrementally add data to the index. Also there is Mantis (Rob Patro's group) which seems to outperform SBTs as well (not sure here). Given that these things are efficiently implemented, I wonder whether sourmash could use these implementations instead of SBT, via C bindings for example (@luizirber).

From the BIGSI article on why SBTs are suboptimal for microbiology:

The first scalable search engine for raw sequence was developed in 2016, the Sequence Bloom Tree (SBT) (13), a k-mer (fixed-length DNA word) index, developed to enable detection of specific transcripts in RNA-seq data. However, SBT and its recent successors (14-16) shared with previous similar methods (Cortex, vari, McCortex) (17-20) a scaling dependence on the total number of k-mers in the union of datasets indexed. For species where there is considerable k-mer sharing between datasets (e.g. human), this works well. However, bacteria are fundamentally different: even within a species there can be enormous diversity due to horizontal transfer of DNA (Supplementary Figure 1), and we will show that this diversity renders previous methods unable to scale.

@luizirber
Copy link
Member

@olgabot, hmm, now I see the motivation for your question =]

After the LCA index was implemented we figured it has a pretty similar API to the SBT index, and with the interested in other indexes it seems it's a good time to make a Index interface/base class and support more options.

About Rust: I'm glad you see my long term plan =P
I started implementing the regular SBT in sourmash-rust, but it involves more effort than the previous goals (moving the C++ core and signature loading/saving). The biggest hurdle is internal nodes being khmer Nodegraphs, but we can also sidestep the problem by implementing new Indexes in Rust and exposing to Python instead of porting everything.

I'll put up some scaffold and open a PR this week, and we can work together to implement it.

@phiweger dirty truth: the Nodegraphs we use are too small, but once they saturate they keep working fine (but with longer search times, because we get more false positives). You still get precise results (because we go down to the leaves, and false positives don't change results), but it could be more efficient.

I'm not opposed to use BIGSI or Mantis, but a benefit for current SBT code is that it's very easy to add more datasets. BIGSI and Mantis are also tackling a harder problem because they take every k-mer into account, and we don't (@rob-p @Phelimb is this a fair assessment of BIGSI and mantis?). But I would focus in fixing the low hanging fruit before going too crazy into implementing everything under the sun (I still need to finish my PhD too! =P)

@luizirber luizirber mentioned this issue Oct 22, 2020
10 tasks
ctb pushed a commit that referenced this issue Oct 28, 2022
Updates the requirements on
[pytest-cov](https://github.com/pytest-dev/pytest-cov) to permit the
latest version.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's
changelog</a>.</em></p>
<blockquote>
<h2>4.0.0 (2022-09-28)</h2>
<p><strong>Note that this release drops support for
multiprocessing.</strong></p>
<ul>
<li>
<p><code>--cov-fail-under</code> no longer causes <code>pytest
--collect-only</code> to fail
Contributed by Zac Hatfield-Dodds in
<code>[#511](pytest-dev/pytest-cov#511)
&lt;https://github.com/pytest-dev/pytest-cov/pull/511&gt;</code>_.</p>
</li>
<li>
<p>Dropped support for multiprocessing (mostly because <code>issue 82408
&lt;https://github.com/python/cpython/issues/82408&gt;</code>_). This
feature was
mostly working but very broken in certain scenarios and made the test
suite very flaky and slow.</p>
<p>There is builtin multiprocessing support in coverage and you can
migrate to that. All you need is this in your
<code>.coveragerc</code>::</p>
<p>[run]
concurrency = multiprocessing
parallel = true
sigterm = true</p>
</li>
<li>
<p>Fixed deprecation in <code>setup.py</code> by trying to import
setuptools before distutils.
Contributed by Ben Greiner in
<code>[#545](pytest-dev/pytest-cov#545)
&lt;https://github.com/pytest-dev/pytest-cov/pull/545&gt;</code>_.</p>
</li>
<li>
<p>Removed undesirable new lines that were displayed while reporting was
disabled.
Contributed by Delgan in
<code>[#540](pytest-dev/pytest-cov#540)
&lt;https://github.com/pytest-dev/pytest-cov/pull/540&gt;</code>_.</p>
</li>
<li>
<p>Documentation fixes.
Contributed by Andre Brisco in
<code>[#543](pytest-dev/pytest-cov#543)
&lt;https://github.com/pytest-dev/pytest-cov/pull/543&gt;</code>_
and Colin O'Dell in
<code>[#525](pytest-dev/pytest-cov#525)
&lt;https://github.com/pytest-dev/pytest-cov/pull/525&gt;</code>_.</p>
</li>
<li>
<p>Added support for LCOV output format via
<code>--cov-report=lcov</code>. Only works with coverage 6.3+.
Contributed by Christian Fetzer in
<code>[#536](pytest-dev/pytest-cov#536)
&lt;https://github.com/pytest-dev/pytest-cov/issues/536&gt;</code>_.</p>
</li>
<li>
<p>Modernized pytest hook implementation.
Contributed by Bruno Oliveira in
<code>[#549](pytest-dev/pytest-cov#549)
&lt;https://github.com/pytest-dev/pytest-cov/pull/549&gt;</code>_
and Ronny Pfannschmidt in
<code>[#550](pytest-dev/pytest-cov#550)
&lt;https://github.com/pytest-dev/pytest-cov/pull/550&gt;</code>_.</p>
</li>
</ul>
<h2>3.0.0 (2021-10-04)</h2>
<p><strong>Note that this release drops support for Python 2.7 and
Python 3.5.</strong></p>
<ul>
<li>Added support for Python 3.10 and updated various test dependencies.
Contributed by Hugo van Kemenade in
<code>[#500](pytest-dev/pytest-cov#500)
&lt;https://github.com/pytest-dev/pytest-cov/pull/500&gt;</code>_.</li>
<li>Switched from Travis CI to GitHub Actions. Contributed by Hugo van
Kemenade in
<code>[#494](pytest-dev/pytest-cov#494)
&lt;https://github.com/pytest-dev/pytest-cov/pull/494&gt;</code>_ and
<code>[#495](pytest-dev/pytest-cov#495)
&lt;https://github.com/pytest-dev/pytest-cov/pull/495&gt;</code>_.</li>
<li>Add a <code>--cov-reset</code> CLI option.
Contributed by Danilo Šegan in
<code>[#459](pytest-dev/pytest-cov#459)
&lt;https://github.com/pytest-dev/pytest-cov/pull/459&gt;</code>_.</li>
<li>Improved validation of <code>--cov-fail-under</code> CLI option.
Contributed by ... Ronny Pfannschmidt's desire for skark in
<code>[#480](pytest-dev/pytest-cov#480)
&lt;https://github.com/pytest-dev/pytest-cov/pull/480&gt;</code>_.</li>
<li>Dropped Python 2.7 support.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/28db055bebbf3ee016a2144c8b69dd7b80b48cc5"><code>28db055</code></a>
Bump version: 3.0.0 → 4.0.0</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/57e9354a86f658556fe6f15f07625c4b9a9ddf53"><code>57e9354</code></a>
Really update the changelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/56b810b91c9ae15d1462633c6a8a1b522ebf8e65"><code>56b810b</code></a>
Update chagelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/f7fced579e36b72b57e14768026467e4c4511a40"><code>f7fced5</code></a>
Add support for LCOV output</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/1211d3134bb74abb7b00c3c2209091aaab440417"><code>1211d31</code></a>
Fix flake8 error</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b077753f5d9d200815fe500d0ef23e306784e65b"><code>b077753</code></a>
Use modern approach to specify hook options</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/00713b3fec90cb8c98a9e4bfb3212e574c08e67b"><code>00713b3</code></a>
removed incorrect docs on <code>data_file</code>.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b3dda36fddd3ca75689bb3645cd320aa8392aaf3"><code>b3dda36</code></a>
Improve workflow with a collecting status check. (<a
href="https://github-redirect.dependabot.com/pytest-dev/pytest-cov/issues/548">#548</a>)</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/218419f665229d61356f1eea3ddc8e18aa21f87c"><code>218419f</code></a>
Prevent undesirable new lines to be displayed when report is
disabled</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/60b73ec673c60942a3cf052ee8a1fdc442840558"><code>60b73ec</code></a>
migrate build command from distutils to setuptools</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest-cov/compare/v2.12.0...v4.0.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants