Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] update JOSS paper for v4 #1361

Merged
merged 37 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
2088689
update JOSS paper for v4
ctb Mar 4, 2021
eb9a57d
more author update
ctb Mar 4, 2021
bae3cce
update text
ctb Mar 4, 2021
b7a598a
Merge branch 'latest' into update/joss
ctb Mar 4, 2021
6473dfc
Merge branch 'latest' of github.com:dib-lab/sourmash into update/joss
ctb Apr 21, 2021
739b4d4
minor text update
ctb Apr 21, 2021
1c0c5c5
update citations
ctb Apr 21, 2021
06ba978
Update paper.md
ctb May 6, 2021
b37db0f
Update paper.md
ctb May 6, 2021
9a2a243
Merge branch 'latest' of github.com:dib-lab/sourmash into update/joss
ctb May 6, 2021
84bdade
Merge branch 'update/joss' of github.com:dib-lab/sourmash into update…
ctb May 6, 2021
ba2d618
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Jan 3, 2023
e27adc0
add joss github action
ctb Jan 3, 2023
0350d13
[MRG] update JOSS for sourmash 4.4 (#2006)
bluegenes Jan 23, 2023
b0d2385
Merge branch 'latest' into update/joss
bluegenes Jan 23, 2023
874c88b
authorship order
bluegenes Jan 23, 2023
9a21646
add auths; reorder to alphabetical except first/last
bluegenes Jan 23, 2023
9aa5d3b
spacing
bluegenes Jan 23, 2023
4a1f3f4
alpha ordering
bluegenes Jan 23, 2023
297c773
comment @ char
bluegenes Jan 23, 2023
4126bd2
ask instead
bluegenes Jan 23, 2023
3151a4c
rm unnecessary punctuation
bluegenes Jan 24, 2023
b5f39f9
Remove Katrin Leinweber due to insignificant contribution (#2473)
katrinleinweber Mar 9, 2023
b1424cc
Merge branch 'update/joss' of github.com:sourmash-bio/sourmash into u…
bluegenes Mar 9, 2023
5ade425
upd orcid
bluegenes Mar 27, 2023
38c0bb0
upd authorship
bluegenes Mar 27, 2023
738c4c8
rm empty affil
bluegenes Mar 30, 2023
447b39c
add corr, eq info
bluegenes Mar 30, 2023
98ec946
index affiliations
bluegenes Mar 30, 2023
60d2cd8
fix no affil auths
bluegenes Mar 31, 2023
1069e14
fix lg orcid
bluegenes Mar 31, 2023
4375f19
fix idnt
bluegenes Mar 31, 2023
d847591
upd affil
bluegenes Apr 1, 2023
6d6b7b6
add swamidass afill
bluegenes Apr 4, 2023
a334dde
institution!
bluegenes Apr 11, 2023
88dcbd9
add dhs statement from @standage
bluegenes May 4, 2023
d7e6cdd
Merge branch 'latest' into update/joss
ctb Aug 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
on: [push]

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper.md
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper.pdf
61 changes: 60 additions & 1 deletion paper.bib
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
@article{ondov2015fast,
@article{Ondov:2015,
title={Fast genome and metagenome distance estimation using MinHash},
author={Ondov, Brian D and Treangen, Todd J and Mallonee, Adam B and Bergman, Nicholas H and Koren, Sergey and Phillippy, Adam M},
journal={bioRxiv},
Expand All @@ -8,3 +8,62 @@ @article{ondov2015fast
doi={10.1101/029827},
url={https://doi.org/10.1101/029827}
}

@article{Brown:2016,
doi = {10.21105/joss.00027},
url = {https://doi.org/10.21105/joss.00027},
year = {2016},
publisher = {The Open Journal},
volume = {1},
number = {5},
pages = {27},
author = {C. Titus Brown and Luiz Irber},
title = {sourmash: a library for MinHash sketching of DNA},
journal = {Journal of Open Source Software}
}

@article{Pierce:2019,
doi = {10.12688/f1000research.19675.1},
url = {https://doi.org/10.12688/f1000research.19675.1},
year = {2019},
month = jul,
publisher = {F1000 Research Ltd},
volume = {8},
pages = {1006},
author = {N. Tessa Pierce and Luiz Irber and Taylor Reiter and Phillip Brooks and C. Titus Brown},
title = {Large-scale sequence comparisons with sourmash},
journal = {F1000Research}
}
@article{gather,
title={Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers},
author={Irber, Luiz Carlos and Brooks, Phillip T and Reiter, Taylor E and Pierce-Ward, N Tessa and Hera, Mahmudur Rahman and Koslicki, David and Brown, C Titus},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}

@article{branchwater,
title={Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search},
author={Irber, Luiz Carlos and Pierce-Ward, N Tessa and Brown, C Titus},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}

@article{koslicki2019improving,
title={Improving minhash via the containment index with applications to metagenomic analysis},
author={Koslicki, David and Zabeti, Hooman},
journal={Applied Mathematics and Computation},
volume={354},
pages={206--215},
year={2019},
publisher={Elsevier}
}

@article{hera2022debiasing,
title={Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances},
author={Hera, Mahmudur Rahman and Pierce-Ward, N Tessa and Koslicki, David},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}
153 changes: 138 additions & 15 deletions paper.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,155 @@
---
title: 'sourmash: a library for MinHash sketching of DNA'
title: 'sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets'
ctb marked this conversation as resolved.
Show resolved Hide resolved
tags:
- FracMinHash
- MinHash
- k-mers
- Python
- Rust
authors:
- name: C. Titus Brown
orcid: 0000-0001-6001-2677
affiliation: University of California, Davis
- name: Luiz Irber
orcid: 0000-0003-4371-9659
affiliation: University of California, Davis
date: 13 Sep 2016
equal-contrib: true
affiliation: 1
- name: N. Tessa Pierce-Ward
orcid: 0000-0002-2942-5331
equal-contrib: true
affiliation: 1
- name: Mohamed Abuelanin
orcid: 0000-0002-3419-4785
affiliation: 1
- name: Harriet Alexander
orcid: 0000-0003-1308-8008
affiliation: 2
- name: Abhishek Anant
orcid: 0000-0002-5751-2010
affiliation: 9
- name: Keya Barve
orcid: 0000-0003-3241-2117
affiliation: 1
- name: Colton Baumler
orcid: 0000-0002-5926-7792
affiliation: 1
- name: Olga Botvinnik
orcid: 0000-0003-4412-7970
affiliation: 3
- name: Phillip Brooks
orcid: 0000-0003-3987-244X
affiliation: 1
- name: Daniel Dsouza
orcid: 0000-0001-7843-8596
affiliation: 9
- name: Laurent Gautier
orcid: 0000-0003-0638-3391
affiliation: 9
- name: Mahmudur Rahman Hera
orcid: 0000-0002-5992-9012
affiliation: 4
- name: Hannah Eve Houts
orcid: 0000-0002-7954-4793
affiliation: 1
- name: Lisa K. Johnson
orcid: 0000-0002-3600-7218
affiliation: 1
- name: Fabian Klötzl
orcid: 0000-0002-6930-0592
affiliation: 5
- name: David Koslicki
orcid: 0000-0002-0640-954X
affiliation: 4
- name: Marisa Lim
orcid: 0000-0003-2097-8818
affiliation: 1
- name: Ricky Lim
orcid: 0000-0003-1313-7076
affiliation: 9
- name: Ivan Ogasawara
orcid: 0000-0001-5049-4289
affiliation: 9
- name: Taylor Reiter
orcid: 0000-0002-7388-421X
affiliation: 1
- name: Camille Scott
orcid: 0000-0001-8822-8779
affiliation: 1
- name: Andreas Sjödin
orcid: 0000-0001-5350-4219
affiliation: 6
- name: Daniel Standage
orcid: 0000-0003-0342-8531
affiliation: 7
- name: S. Joshua Swamidass
orcid: 0000-0003-2191-0778
affiliation: 8
- name: Connor Tiffany
orcid: 0000-0001-8188-7720
affiliation: 9
- name: Pranathi Vemuri
orcid: 0000-0002-5748-9594
affiliation: 3
- name: Erik Young
orcid: 0000-0002-9195-9801
affiliation: 1
- name: C. Titus Brown
orcid: 0000-0001-6001-2677
corresponding: true
affiliation: 1
affiliations:
- name: University of California, Davis
index: 1
- name: Woods Hole Oceanographic Institute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lokos good-- but should be Institution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 I should have caught this - happens with Scripps all the time. Will fix!

index: 2
- name: Chan-Zuckerberg Biohub
index: 3
- name: Pennsylvania State University
index: 4
- name: MPI for Evolutionary Biology
index: 5
- name: Swedish Defence Research Agency (FOI)
index: 6
- name: National Bioforensic Analysis Center
index: 7
- name: Washington University in St Louis
index: 8
- name: No affiliation
index: 9

date: 27 Mar 2023
bibliography: paper.bib
---

# Summary

sourmash is a toolbox for creating, comparing, and manipulating MinHash
sketches of genomic data.
sourmash is a command line tool and Python library for sketching
collections of DNA, RNA, and amino acid k-mers for biological sequence
search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.

FracMinHash sketching is a lossy compression approach that represents
data sets using a "fractional" sketch containing $1/S$ of the original
k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and containment calculations, which enables analysis and comparison of data sets
of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].

# Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data
sets are readily available in biology, and the field needs lightweight
computational methods for searching and summarizing the content of
both public and private collections. sourmash provides a flexible set
of programmatic functionality for this purpose, together with a robust
and well-tested command-line interface. It has been used in well over 200
publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues
to expand in functionality.

MinHash sketches provide a lightweight way to store "signatures" of
large DNA or RNA sequence collections, and then compare or search them
using a Jaccard index. MinHash sketches can be used to identify samples,
find similar samples, identify data sets with shared sequences, and
build phylogenetic trees [@ondov2015fast].
# Acknowledgements

sourmash provides a command line script, a Python library, and a CPython
module for MinHash sketches.
This work is funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB].

# References
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ version = "4.6.1"

authors = [
{ name="Luiz Irber", orcid="0000-0003-4371-9659" },
{ name="N. Tessa Pierce-Ward", orcid="0000-0002-2942-5331" },
{ name="Mohamed Abuelanin", orcid="0000-0002-3419-4785" },
{ name="Harriet Alexander", orcid="0000-0003-1308-8008" },
{ name="Abhishek Anant", orcid="0000-0002-5751-2010" },
Expand All @@ -33,7 +34,6 @@ authors = [
{ name="Marisa Lim", orcid="0000-0003-2097-8818" },
{ name="Ricky Lim", orcid="0000-0003-1313-7076" },
{ name="Ivan Ogasawara", orcid="0000-0001-5049-4289" },
{ name="N. Tessa Pierce", orcid="0000-0002-2942-5331" },
{ name="Taylor Reiter", orcid="0000-0002-7388-421X" },
{ name="Camille Scott", orcid="0000-0001-8822-8779" },
{ name="Andreas Sjödin", orcid="0000-0001-5350-4219" },
Expand Down