Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Commit

Permalink
Point to a real S3 object, still need to solve 'failed to fill whole …
Browse files Browse the repository at this point in the history
…buffer' though, I suspect it might be the Cursor? Reading https://rust-lang.github.io/rfcs/0980-read-exact.html, /cc @zaeleus
  • Loading branch information
brainstorm authored Mar 26, 2021
1 parent a1a9e47 commit 9e7a200
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ async fn stream_s3_object() -> Result<Cursor<Vec<u8>>, Error> {
let mut s3_obj_buffer = Cursor::new(Vec::new());
let aws = Storage {
region: Region::ApSoutheast2,
credentials: Credentials::from_instance_metadata()?,
bucket: "umccr-primary-data-dev".to_string(),
credentials: Credentials::default()?,
bucket: "umccr-research-dev".to_string(),
};

let bucket = Bucket::new(&aws.bucket, aws.region, aws.credentials)?;
bucket.get_object_stream("sample-file.bam", &mut s3_obj_buffer).await?;
bucket.get_object_stream("/htsget/htsnexus_test_NA12878.bam", &mut s3_obj_buffer).await?;
return Ok(s3_obj_buffer);
}

Expand Down

7 comments on commit 9e7a200

@zaeleus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to solve 'failed to fill whole buffer' though

The cursor position is moved, in this case, to the end when writing to the buffer; so there's nothing for the reader to consume. Try resetting the position to the start before reading: s3_obj_buffer.set_position(0).

@brainstorm
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch, thanks, that did the trick:

diff --git a/src/main.rs b/src/main.rs
index ce6843a..0700257 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -8,7 +8,6 @@ use s3::creds::Credentials;
 use s3::region::Region;
 
 use noodles_bam as bam;
-use noodles_sam as sam;
 
 struct Storage {
     region: Region,
@@ -18,7 +17,7 @@ struct Storage {
 
 #[tokio::main]
 async fn main() -> Result<(), Box<dyn Error>> {
-    s3_read_bam_header().await?;
+    dbg!(s3_read_bam_header().await?);
     Ok(())
 }
 
@@ -39,6 +38,8 @@ async fn stream_s3_object() -> Result<Cursor<Vec<u8>>, Box<dyn Error>> {
 
     let bucket = Bucket::new(&aws.bucket, aws.region, aws.credentials)?;
     bucket.get_object_stream("/htsget/htsnexus_test_NA12878.bam", &mut s3_obj_buffer).await?;
+    // Rewind buffer Cursor after writing, so that next reader can consume data 
+    s3_obj_buffer.set_position(0);
     return Ok(s3_obj_buffer);
 }
 
@@ -46,19 +47,8 @@ async fn stream_s3_object() -> Result<Cursor<Vec<u8>>, Box<dyn Error>> {
 async fn read_bam_header(bam_bytes: Cursor<Vec<u8>>) -> Result<Value, Box<dyn Error>> {
     let mut reader = bam::Reader::new(bam_bytes);
     let header = reader.read_header()?;
+    //let refseqs = reader.read_reference_sequences().map(|x| x[1].unwrap().name);
 
-    if header.is_empty() {
-        let reference_sequences = reader.read_reference_sequences()?;
-        let mut builder = sam::Header::builder();
-
-        for reference_sequence in reference_sequences {
-            builder = builder.add_reference_sequence(reference_sequence);
-        }
-
-        Ok(json!({ "header": builder.build().to_string(),
-                   "message": "success" }))
-    } else {
-        Ok(json!({ "header": header,
-                   "message": "success" }))
-    }
+    Ok(json!({ "header": header,
+               "message": "success" }))
 }

@brainstorm
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The io::Result<String> type of read_header is a bit rough for pretty-printing. I'm thinking about PR-ing Noodles so that read_header it returns a struct instead and so that can be easily consumed by SerDe?

Is that a change you'd welcome? ;)

/cc @victorskl

@zaeleus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bam::Reader::read_header intentionally returns a raw string. The implementation of std::str::FromStr for sam::Header can be used to parse the header:

let raw_header = reader.read_header()?;
let header: sam::Header = raw_header.parse()?;
eprintln!("{:#?}", header);
Output
Header {
    header: Some(
        Header {
            version: Version {
                major: 1,
                minor: 6,
            },
            sort_order: None,
            group_order: None,
            subsort_order: None,
            fields: {},
        },
    ),
    reference_sequences: {
        "sq0": ReferenceSequence {
            name: "sq0",
            len: 8,
            alternative_locus: None,
            alternative_names: None,
            assembly_id: None,
            description: None,
            md5_checksum: None,
            species: None,
            molecule_topology: None,
            uri: None,
            fields: {},
        },
        "sq1": ReferenceSequence {
            name: "sq1",
            len: 13,
            alternative_locus: None,
            alternative_names: None,
            assembly_id: None,
            description: None,
            md5_checksum: None,
            species: None,
            molecule_topology: None,
            uri: None,
            fields: {},
        },
        "sq2": ReferenceSequence {
            name: "sq2",
            len: 21,
            alternative_locus: None,
            alternative_names: None,
            assembly_id: None,
            description: None,
            md5_checksum: None,
            species: None,
            molecule_topology: None,
            uri: None,
            fields: {},
        },
    },
    read_groups: {},
    programs: {
        "noodles-bam": Program {
            id: "noodles-bam",
            name: None,
            command_line: None,
            previous_id: None,
            description: None,
            version: None,
            fields: {},
        },
    },
    comments: [
        "an example BAM written by noodles-bam",
    ],
}

Serialialization is done via std::fmt::Display, e.g.,

println!("{}", header);
Output
@HD     VN:1.6
@SQ     SN:sq0  LN:8
@SQ     SN:sq1  LN:13
@SQ     SN:sq2  LN:21
@PG     ID:noodles-bam
@CO     an example BAM written by noodles-bam

@brainstorm
Copy link
Owner Author

@brainstorm brainstorm commented on 9e7a200 Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zaeleus, I understand that, I went through a similar std::fmt::Display situation with YaSerDe... but my question was more w.r.t serde::ser::Serialize, namely, if I wanted to serialize the SAM header to JSON, like this:

--git a/src/main.rs b/src/main.rs
index 41d3bc8..67dfbdc 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -46,7 +47,10 @@ async fn stream_s3_object() -> Result<Cursor<Vec<u8>>, Box<dyn Error>> {
 /// Reads BAM S3 object header
 async fn read_bam_header(bam_bytes: Cursor<Vec<u8>>) -> Result<Value, Box<dyn Error>> {
     let mut reader = bam::Reader::new(bam_bytes);
-    let header = reader.read_header()?;
+    let raw_header = reader.read_header()?;
+    let header: sam::Header = raw_header.parse()?;

     Ok(json!({ "header": header,
                "message": "success" }))

I'm met with the following non implemented trait (understandably):

error[E0277]: the trait bound `noodles_sam::Header: serde::ser::Serialize` is not satisfied
   --> src/main.rs:55:8
    |
55  |       Ok(json!({ "header": header,
    |  ________^
56  | |                "message": "success" }))
    | |______________________________________^ the trait `serde::ser::Serialize` is not implemented for `noodles_sam::Header`
    |
   ::: /Users/rvalls/.cargo/registry/src/github.com-1ecc6299db9ec823/serde_json-1.0.64/src/value/mod.rs:965:8
    |
965 |       T: Serialize,
    |          --------- required by this bound in `to_value`
    |
    = note: required because of the requirements on the impl of `serde::ser::Serialize` for `&noodles_sam::Header`
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

For more information about this error, try `rustc --explain E0277`.
error: could not compile `s3-bamheader`

My guess is that you'll like this to be a future third party crate impl (NoodlesSerDe?) instead of bolted into your crate(s)?

I'm just trying to get a pulse on how lean vs feature-rich you want Noodles to be or become. Since you are the maintainer, I will totally respect your philosophy and thinking from now on in either way since I have an intuition about the tradeoffs, I just want to see where you err at the moment.

Or perhaps there's already a way to do this? If that's the case I'm learning as I go through your codebase, so bear with me. Also, there's no rush at all to reply on a weekend, btw ;)

/cc @victorskl

@zaeleus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're asking. No, there is currently no implementation of this in noodles. You would have to write your own serializer for each header type. This itself isn't hard but may be tedious, given the number of optional fields in each type.

This seems too niche to include as a standard feature. Is there an application or spec that reads SAM headers as something serialized other than tab-delimited text?

@brainstorm
Copy link
Owner Author

@brainstorm brainstorm commented on 9e7a200 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too niche to include as a standard feature.

Agreed. I've been discussing with @chris-zen on how hard would it be to annotate the structs for SerDe in a third party crate. Not urgent at all, just curiosity level at this point in time.

Is there an application or spec that reads SAM headers as something serialized other than tab-delimited text?

Indeed, other groups have used Apache Parquet in the past, such as ADAM, Presto/Athena UDFs and also other formats such as ORC, which could be used by Spark bioinformatics tooling.

In a less "big data" mindset, simple examples that spit out JSON can be fairly amenable for small WASM prototypes, genome visualizers and whatnot. Hope that makes sense?

Please sign in to comment.