Skip to content

WIP: draft PR merging Jason's changes to main. #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
8280ac1
added From<&Kmer> for CanonicalKmer
theJasonFan Mar 7, 2023
df32c7b
derive serde::{Serialize, Deserialize} to pub structs, derive Copy fo…
theJasonFan Mar 9, 2023
63f53c4
Remove copy for SeqVecSlice
theJasonFan Mar 9, 2023
1aa4bb4
remove Eq, PartialEq derivation from SeqVecSlice
theJasonFan Mar 9, 2023
4c40e9f
add set_chars to seqvector, fixed module use errors in minimizers
theJasonFan Mar 20, 2023
2be15d8
cargo fmt
theJasonFan Mar 20, 2023
b965b99
new and with_len for SeqVector
theJasonFan Mar 20, 2023
320eadf
fix push_chars and set_chars for u64 aligned push and set for seqvector
theJasonFan Mar 20, 2023
26b84b5
fix length check SeqVector::set_chars
theJasonFan Mar 20, 2023
e3bf165
remove dbg!() statements
theJasonFan Mar 22, 2023
c1db195
num_bits for seqvectors
theJasonFan Jul 3, 2023
73d753b
wip: canonical minimizer iterator
theJasonFan Jul 3, 2023
1b580d4
wip: canonical minimizers
theJasonFan Jul 3, 2023
cbfdfbc
use simpler serde compat from simple-sds fork
theJasonFan Jul 3, 2023
6c71294
canonical minimizer iterator
theJasonFan Jul 3, 2023
348509b
todo: test get canonical minimizer
theJasonFan Jul 3, 2023
a53bec1
wip: test canonical minimizer getter
theJasonFan Jul 3, 2023
bcb20ae
Kmer::canonical_minimizer()
theJasonFan Jul 4, 2023
15a3125
simplify canonical minmizer comp
theJasonFan Jul 4, 2023
e4b2e03
fix impls of canonical minimizers, s.t. mini(g*) = mini(min(g, g'))
theJasonFan Jul 14, 2023
8cf4596
getter for mappedminimizer
theJasonFan Jul 14, 2023
9d14d22
expose manipulation of super k-mer positions
theJasonFan Jul 14, 2023
042134a
fix super kmer iter
theJasonFan Jul 14, 2023
6d271b3
fixed canonical super-k-mer enumeration
theJasonFan Jul 15, 2023
ba88f3c
impl Iterator for CanonicalKmerIterator
theJasonFan Jul 21, 2023
7cbfa94
Merge pull request #17 from COMBINE-lab/main
rob-p Aug 16, 2023
b756de0
remove canonical super-k-ners
theJasonFan Aug 16, 2023
c810bed
make mapped minimizer position pub
theJasonFan Aug 16, 2023
75d0d8f
feat: failable encoding and kmer conversion from bytes
theJasonFan Aug 22, 2023
5582d13
feat: kmer! macro
theJasonFan Aug 23, 2023
d945c29
add: jf contrib to readme
theJasonFan Aug 23, 2023
c79b590
merge upstream/dev
theJasonFan Aug 24, 2023
1514f97
Merge pull request #18 from theJasonFan/main
rob-p Nov 5, 2023
c2b6e16
point to our own sds
rob-p Nov 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
/target
Cargo.lock
*.bk
.vscode
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ edition = "2021"
bit_field = "0.10"
serde = { version = "1.0", features = ["derive"] }
num = "0.4.0"
simple-sds = {git = "https://github.com/thejasonfan/simple-sds", branch = "serde_compat", optional = true }
simple-sds = {git = "https://github.com/COMBINE-lab/simple-sds", branch = "simpler_serde_compat", optional = true }

[features]
seq-vector = ["dep:simple-sds"]
Expand All @@ -23,4 +23,4 @@ quickcheck_macros = "1"

[[bench]]
name = "simple_benchmark"
harness = false
harness = false
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ potentially interested parties via [twitter](https://twitter.com/nomad421/status
* [Pierre Marijon](https://github.com/natir)
* [Luiz Irber](https://github.com/luizirber)
* [Rob Patro](https://github.com/rob-p)
* [Jason Fan](https://github.com/theJasonFan)

## Minimum supported Rust version

Expand Down
4 changes: 3 additions & 1 deletion src/kmer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@

/* crate use */
use bit_field::BitArray;
use serde::{Deserialize, Serialize};
use std::u32;

Check failure on line 6 in src/kmer.rs

View workflow job for this annotation

GitHub Actions / Lints

importing legacy numeric constants

/* project use */
use crate::encoding;

/// Struct to store and use kmer
#[derive(Debug)]
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
#[serde(bound = "[P; B]: Serialize + for<'a> Deserialize<'a>")]
pub struct Kmer<P, const K: usize, const B: usize> {
array: [P; B],
}
Expand Down
14 changes: 12 additions & 2 deletions src/naive_impl/canonical_kmer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ use super::Kmer;
use serde::{Deserialize, Serialize};
use std::convert::From;

#[derive(Serialize, Deserialize, Eq, PartialEq, Debug, Clone, Copy)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum MatchType {
NoMatch,
IdentityMatch,
TwinMatch,
}

#[derive(Eq, PartialEq, Default, Debug, Clone, Ord, PartialOrd)]
#[derive(Eq, PartialEq, Default, Debug, Clone, Ord, PartialOrd, Serialize, Deserialize)]
pub struct CanonicalKmer {
fw: Kmer,
rc: Kmer,
Expand Down Expand Up @@ -171,6 +171,16 @@ impl From<Kmer> for CanonicalKmer {
}
}

impl From<&Kmer> for CanonicalKmer {
#[inline]
fn from(km: &Kmer) -> Self {
Self {
rc: km.to_reverse_complement(),
fw: km.clone(),
}
}
}

impl From<String> for CanonicalKmer {
fn from(s: String) -> Self {
let fk = Kmer::from(s);
Expand Down
49 changes: 46 additions & 3 deletions src/naive_impl/canonical_kmer_iterator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,32 @@

use super::prelude::*;
use super::CanonicalKmer;
use serde::{Deserialize, Serialize};

// holds what is essentially a pair of
// km: the canonical k-mer on the read
// pos: the offset on the read where this k-mer starts
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct CanonicalKmerPos {
pub km: CanonicalKmer,
pub pos: i32,
pub(super) pos: i32, // FIXME: change to use Wrapping<...>
}

impl CanonicalKmerPos {
fn new(k: u8) -> Self {
pub fn new(km: CanonicalKmer, pos: usize) -> Self {
Self {
km,
pos: pos as i32,
}
}

pub fn pos(&self) -> usize {
self.pos as usize
}
}

impl CanonicalKmerPos {
fn blank_of_size(k: u8) -> Self {
Self {
km: CanonicalKmer::blank_of_size(k),
pos: -1i32,
Expand All @@ -29,6 +44,7 @@ impl CanonicalKmerPos {
// It is capable of iterating over this sequence (skipping invalid
// k-mers, e.g. k-mers containing `N`), and producing
// a `CanonicalKmerPos` struct for all valid k-mers in `seq`.
#[derive(Debug, Clone)]
pub struct CanonicalKmerIterator<'a> {
seq: &'a [u8],
value_pair: CanonicalKmerPos,
Expand Down Expand Up @@ -72,7 +88,7 @@ impl<'slice> CanonicalKmerIterator<'slice> {
pub fn from_u8_slice(s: &'slice [u8], k: u8) -> CanonicalKmerIterator {
let mut r = Self {
seq: s,
value_pair: CanonicalKmerPos::new(k),
value_pair: CanonicalKmerPos::blank_of_size(k),
invalid: false,
last_invalid: -1i32,
k: k as i32,
Expand Down Expand Up @@ -116,6 +132,20 @@ impl<'slice> CanonicalKmerIterator<'slice> {
}
}

impl Iterator for CanonicalKmerIterator<'_> {
type Item = CanonicalKmerPos;

fn next(&mut self) -> Option<Self::Item> {
if self.exhausted() {
None
} else {
let item = Some(self.get().clone());
self.inc();
item
}
}
}

#[cfg(test)]
mod tests {
use super::*;
Expand Down Expand Up @@ -204,4 +234,17 @@ mod tests {
ck_iter.inc();
assert!(ck_iter.exhausted());
}

#[test]
fn test_iter() {
let r = b"NAAANTTT";
let k = 3;
let ck_iter = CanonicalKmerIterator::from_u8_slice(r, k);
let kms: Vec<CanonicalKmerPos> = ck_iter.collect();
let kws = vec![
CanonicalKmerPos::new(CanonicalKmer::from("AAA"), 1),
CanonicalKmerPos::new(CanonicalKmer::from("TTT"), 5),
];
assert_eq!(kms, kws)
}
}
93 changes: 93 additions & 0 deletions src/naive_impl/checked.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
use super::{Base, Kmer};

/// This module can only return one error, the EncodeError
#[derive(Debug, PartialEq, Eq)]
pub struct EncodeError;
type Result<T> = std::result::Result<T, EncodeError>;

impl Kmer {
/// Failable conversion from byte slice to Kmer
pub fn from_bytes_checked(s: &[u8]) -> Result<Self> {
if s.len() > 32 {
panic!("kmers longer than 32 bases not supported");
}

let k = s.len() as u8;

let mut w = 0_u64;
// read sequence "left to right" from "lower to higher" order bits
for c in s.iter().rev() {
w <<= 2;
w |= encode_binary_checked(*c as char)?;
}
let data = w;

Ok(Kmer { data, k })
}
}

/// Failable encoding
pub fn encode_binary_checked(c: char) -> Result<Base> {
let code = CODES[c as usize];
if code >= 0 {
Ok(code as Base)
} else {
Err(EncodeError)
}
}

// see Kmer.hpp
const R: i32 = -1;
const I: i32 = -2;
const O: i32 = -3;
const A: i32 = 0;
const C: i32 = 1;
const G: i32 = 2;
const T: i32 = 3;

const CODES: [i32; 256] = [
O, O, O, O, O, O, O, O, O, O, I, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
O, O, O, O, O, O, O, O, O, O, O, O, O, R, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
O, A, R, C, R, O, O, G, R, O, O, R, O, R, R, O, O, O, R, R, T, O, R, R, R, R, O, O, O, O, O, O,
O, A, R, C, R, O, O, G, R, O, O, R, O, R, R, O, O, O, R, R, T, O, R, R, R, R, O, O, O, O, O, O,
O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,
];

#[cfg(test)]
mod test {
use super::*;

#[test]
fn checked_nuc_encoding() {
let allowed = ['A', 'C', 'G', 'T', 'a', 'c', 'g', 't'];
let expected_err = Err(EncodeError);
for c in 0..(char::MAX as u8) {
let c = c as char;
if !allowed.contains(&c) {
assert_eq!(expected_err, encode_binary_checked(c));
}
}

let nucs: Vec<Result<u64>> = allowed.iter().map(|c| encode_binary_checked(*c)).collect();
let codes: Vec<Result<u64>> = [0, 1, 2, 3, 0, 1, 2, 3].iter().map(|i| Ok(*i)).collect();
assert_eq!(nucs, codes);
}

#[test]
fn checked_kmer_encoding() {
let bytes = b"ANa";
let kw = Kmer::from_bytes_checked(bytes);

assert_eq!(kw, Err(EncodeError));

let bytes = b"acgt";
let km = Kmer::from(bytes);
let kw = Kmer::from_bytes_checked(bytes).unwrap();

assert_eq!(km, kw);
assert_eq!(kw.len(), 4);
}
}
Loading
Loading