htsget: specify how to request unmapped (unplaced) reads #320

mlin · 2018-06-01T14:59:34Z

...by setting referenceName=*. This is kind of the obvious approach following SAM where unmapped reads have RNAME=*, but other ideas are welcome!

mlin · 2018-06-01T15:00:12Z

@saupchurch interest

lbergelson · 2018-06-01T15:39:47Z

Is it worth talking about requesting a subset of unmapped reads here? Also, guarantees about the ordering of unmapped reads?

magicDGS · 2018-06-04T07:19:59Z

htsget.md

@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma
 `referenceName` 
 _optional_
 </td><td>
-The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
+The reference sequence name, for example "chr1", "1", or "chrX". If "*", unmapped (unplaced) reads are returned. If unspecified, all reads (mapped and unmapped) are returned.


I think that for correctly formatting in markdown you should scape the * (using \*)

magicDGS · 2018-06-04T07:20:15Z

htsget.md

@@ -175,7 +175,7 @@ _optional 32-bit unsigned integer_
 </td><td>
 The start position of the range on the reference, 0-based, inclusive. 

-The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`).
+The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is unspecified or * (see `referenceName`).


same as before

Quotation marks around * here, and below too.

magicDGS · 2018-06-04T07:20:27Z

htsget.md

@@ -185,7 +185,7 @@ _optional 32-bit unsigned integer_
 </td><td>
 The end position of the range on the reference, 0-based exclusive.

-The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is not specified (see `referenceName`).
+The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is unspecified or * (see `referenceName`).


jmarshall · 2018-06-05T10:44:09Z

I'm not sure I'm loving referenceName=* — this is a textual URL and we have plenty of space to invent a new key like whatIwantis=unmapped (to be used instead of referenceName/start/end), but I haven't thought about it long enough to have a specific idea.

In any case you should say what you're using unplaced to mean, maybe by adapting the footnote in the SAM spec.

jmarshall · 2018-06-05T11:01:30Z

Having thought about it a little longer: both this and #311 (header-only) are wanting to add something to be used instead of the existing referenceName/start/end query parameters. Perhaps we can come up with a common parameter that would suit both of these use cases?

For example, class=unplaced and class=headers. Other name suggestions are welcome; I tried to think of something generic and similar to the existing orthogonal to referenceName/start/end parameters (namely fields and tags).

saupchurch · 2018-06-14T17:47:06Z

I'm not sure of the value of subsets of unmapped reads. The genesis of this request comes from fulfilling an RNA-seq use case. For users who wish to re-align a dataset they need to regenerate the full set of initial reads from the BAM/CRAM file (as the fastq is not in scope of htsget). As long as referenceName=* or equivalent returns everything we will be able to support this use case.

lbergelson · 2018-06-14T19:43:41Z

The use case I was imagining was using htsget to serve a distributed computation. Since the unmapped reads can be a significant chunk of the bam, if you want to process them as part of your job you'll want to divide them into multiple shards, otherwise you end up with one extremely long running task dealing with the unmapped reads.

For that purpose it would be very useful to be able to

find out how many unmapped reads there are
request a subset of them at a time

jkbonfield · 2018-06-15T08:51:59Z

This depends on the nature of the index.

For cram it's possible to do random access at the block level and I wrote a tool to do this way back when in the early days of the streaming API: https://groups.google.com/forum/#!searchin/ga4gh-dwg-directory-api-and-streaming/cram_filter|sort:date/ga4gh-dwg-directory-api-and-streaming/oWYF1ASAZ0Y/T82vyqMcDQAJ (see the cram_filter comment).

However I'm not sure this is easily possible in any of the BAI, CSI style indices as they're indexed positions rather than records.

That said, as far as the protocol / language goes, we can define this now and do implementations as and when they mature.

lbergelson · 2018-06-15T21:23:02Z

@jkbonfield That's a good point. The new splitting index can potentially help. If the exact granularity is specified you can figure out what read number the unmapped reads start at by finding the first split before the unmapped, counting to the start of that position, and then using that as a starting point for jumping into the unmapped ones.

An exact "give me this number of reads" might not be necessary either. Something like "I want the first of N chunks of reads, do your best to split them evenly for me." would be sufficient for my use case.

jmarshall · 2018-07-11T14:35:57Z

htsget.md

@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma
 `referenceName` 
 _optional_
 </td><td>
-The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
+The reference sequence name, for example "chr1", "1", or "chrX". If "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced* reads are a subset of unmapped reads; see the [SAM specification](https://github.com/samtools/hts-specs/raw/master/SAMv1.pdf) for details of this concept)


I'd prefer a description in place here rather than a link. But if it's going to be a link, the URL to use is http://samtools.github.io/hts-specs/SAMv1.pdf.

mlin · 2018-07-11T16:31:21Z

Our 'htsnexus' server implements this, example usage from its tests: https://github.com/dnanexus-rnd/htsnexus/blob/master/test/htsnexus_integration.t#L68

The asterisk appears not to be an issue in URLs and can be provided on a bash shell by double-quoting it as in that example.

jeromekelleher · 2018-07-11T16:41:14Z

Seems to work fine in the Python client:

$ htsget http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878 -r '*' -O tmp.bam -vv
2018-07-11 17:34:23,958 handle_ticket_request(url=http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A, headers={})
2018-07-11 17:34:23,961 Starting new HTTP connection (1): htsnexus.rnd.dnanex.us
2018-07-11 17:34:24,157 http://htsnexus.rnd.dnanex.us:80 "GET /v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A HTTP/1.1" 200 None
2018-07-11 17:34:24,160 handle_data_uri(application/octet-stream;base64, length=8850)
2018-07-11 17:34:24,160 handle_http_url(url=https://dl.dnanex.us/F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam, headers={'referer': 'http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A', 'range': 'bytes=140474753596-140786687527'})
2018-07-11 17:34:24,163 Starting new HTTPS connection (1): dl.dnanex.us
2018-07-11 17:34:31,820 https://dl.dnanex.us:443 "GET /F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam HTTP/1.1" 206 311933932

Note that the '*' is URL encoded as referenceName=%2A here. Looking at the downloaded file we get

$ samtools view tmp.bam | cut -f 1,2,3 | head -n 40
HK3T5CCXX160204:8:2215:26545:9607	97	NC_007605
HJYFJCCXX160204:8:2123:12763:31441	73	NC_007605
HJYFJCCXX160204:8:2123:12763:31441	133	NC_007605
HK35MCCXX160204:5:2221:26778:11242	345	NC_007605
HK35NCCXX160204:2:1206:6664:3577	409	NC_007605
HK3T5CCXX160204:6:2203:15666:37348	345	NC_007605
HK3T5CCXX160204:4:2120:5041:37260	409	NC_007605
HK3T5CCXX160204:6:2204:30218:57319	165	NC_007605
HK3T5CCXX160204:6:2204:30218:57319	1113	NC_007605
HJYFJCCXX160204:8:1121:9993:6636	161	NC_007605
HK35MCCXX160204:2:2117:14387:54612	345	NC_007605
HK3T5CCXX160204:2:1106:25296:72983	409	NC_007605
HK3T5CCXX160204:5:2117:11201:31195	97	NC_007605
HK35MCCXX160204:6:1217:10348:49127	147	NC_007605
HK35MCCXX160204:3:1117:17289:55069	163	NC_007605
HK35NCCXX160204:1:2221:21542:59833	409	NC_007605
HJYFJCCXX160204:4:1120:12936:70346	97	NC_007605
HK3T5CCXX160204:2:2114:21937:40899	409	NC_007605
HK35MCCXX160204:5:2108:25702:26290	161	NC_007605
HK3T5CCXX160204:3:2117:8237:46560	147	NC_007605
HJYFJCCXX160204:4:1101:10003:15162	77	*
HJYFJCCXX160204:4:1101:10003:15162	141	*
HJYFJCCXX160204:4:1101:10013:22423	77	*
HJYFJCCXX160204:4:1101:10013:22423	141	*
HJYFJCCXX160204:4:1101:10135:54700	77	*
HJYFJCCXX160204:4:1101:10135:54700	141	*
HJYFJCCXX160204:4:1101:10155:51711	77	*
HJYFJCCXX160204:4:1101:10155:51711	141	*
HJYFJCCXX160204:4:1101:10257:22739	77	*
HJYFJCCXX160204:4:1101:10257:22739	141	*
HJYFJCCXX160204:4:1101:10277:44292	77	*
HJYFJCCXX160204:4:1101:10277:44292	141	*
HJYFJCCXX160204:4:1101:10368:6126	77	*
HJYFJCCXX160204:4:1101:10368:6126	141	*
HJYFJCCXX160204:4:1101:10429:16217	77	*
HJYFJCCXX160204:4:1101:10429:16217	141	*
HJYFJCCXX160204:4:1101:10510:41462	77	*
HJYFJCCXX160204:4:1101:10510:41462	141	*
HJYFJCCXX160204:4:1101:10541:19399	77	*
HJYFJCCXX160204:4:1101:10541:19399	141	*

Which looks like it's working, right? We get a few records for the last reference name because of block overlaps before getting the '*' records.

jmarshall · 2018-07-11T16:43:05Z

I guess if you're typing https://example.org/reads/foo.bam?referenceName=chr1&end=1000 on the command line you're already quoting it to escape the ? and &…

jeromekelleher · 2018-07-11T16:48:07Z

LGTM, I vote to merge. I don't think we need to state that '*' must be URL encoded as %2A --- clients should do that automatically before sending off the request anyway.

jmarshall · 2018-07-11T17:04:40Z

That question of URL-encoding is exactly the sort of thing I was concerned about.

Do the HTTP standards require * data to be URL-encoded in the query-part of an URL? What happens if you don't? Does it behoove htsget to assign such a character special meaning in its specification?

I see * is in RFC 3986 as a sub-delims character, which should be precent-encoded “unless these characters are specifically allowed by the URI scheme to represent data in that component”. It's unclear whether they are specifically allowed in HTTP(S) query parts. Hmmm…

jeromekelleher · 2018-07-11T17:22:25Z

I think the client should be expected to automatically URL encode any referenceName string, regardless. There's nothing preventing a referenceName of "a:b+c" (probably; I'm not sure what the SAM spec allows and doesn't), which would have to be URL encoded to be transferred correctly. There's no reason to treat '*' any differently. Clients should be expected to URL encode referenceName strings as a matter of course, this is just part of using HTTP.

mlin · 2018-07-13T20:36:38Z

@jmarshall I understand your doubts about overloading referenceName and the URL encoding; equally I hope it's obvious the approach has an expedient rationale (reflecting SAM). Could I press you to please declare if we should consider these doubts as blockers? (which I'm sure we'd respect of course)

jmarshall · 2018-07-19T10:40:40Z

Jerome is correct that clients will need to percent-encode caller-supplied referenceNames as a matter of course. I was thinking also of the convenience of doing things manually, e.g. when hand-typing a request for unplaced unmapped reads. However I have now tested a basic FORM GET and note that both Safari and Firefox in fact don't percent-encode * in HTTP query parts. So I withdraw any percent-encoding-related concerns about giving the * character a special meaning in our protocol.

There remain design considerations.

Samtools et al use a chr[:start-end] notation because it is succinct, human-friendly, and quick and easy to type. HTSlib (samtools/htslib@c8bdd76) rather obscurely shoehorned * into that as a way to ask for the unplaced reads at the end of the file, because it could be fitted into the syntax (as in-band signalling, but happily * is reserved as a reference name).

We don't have to be succinct in our URL query parameters, so we have split that out into referenceName/start/end. So we have a choice as to whether to follow the HTSlib practice with referenceName=* (which the uninitiated may think is asking for all references!) or to further split this out as something similarly textual, e.g., referenceUnmapped=only or class=unplaced, thus avoiding in-band signalling.

I am ambivalent on this — there are pros and cons on both sides. It would be good to hear other group members' opinions.

(I have further editorial comments on the proposed text, should the consensus be to go with *.)

jmarshall · 2018-09-04T10:20:00Z

@lbergelson has suggested extending this in the future to be able to request a shard of the unplaced reads (see #320 (comment)). For example, the protocol could pretend that there is always a scaled total of 10,000 unplaced entries, so that start=1000/end=2000 would represent asking for the second 10% shard out of the pool of unplaced reads.

Then, under this PR's proposal, ?referenceName=*&start=1000&end=2000 would request this second 10% shard. Servers determining how to interpret start/end (whether as real coordinates or scaled counts) would need to look at the value of referenceName and see whether it was * or not.

Or under this other idea, ?class=unplaced&start=1000&end=2000 would request the same thing. Servers determining how to interpret start/end would look at which one of referenceName or class is present.

IMHO the in-band signalling inherent in servers needing to look at the value of one parameter to figure out how to interpret other parameters is a bad URI design and will make it harder to extend the protocol in future.

(I'm happy to be outvoted on this, but that would require other group members to actually declare an opinion!)

jmarshall · 2018-09-04T13:40:01Z

Editorial comments on the text as proposed:

Better as “For the reads endpoint” or similar, rather than “Reads formats only” — this is about the type of data, not the particular formats.
The proposal is for the spec to use the concept of unplaced unmapped reads, so I really think it behooves the spec to define the concept itself rather than merely point to the SAM spec. Something like

Unplaced unmapped reads are the records completely without location information, i.e., with RNAME and POS field values of "*" and 0. (See the SAM specification for further details of placed and unplaced unmapped reads.)
The server SHOULD respond with an InvalidInput error if start is specified and a reference is unspecified or "*" (see referenceName).

This is no longer grammatical. Maybe something like
… if start is specified and either no reference is specified or referenceName is "*".

(Recall that using generic reference rather than specific referenceName in this piece of text leaves the door open for alternative ways of specifying references; cf htsget referenceMD5 query parameter #210.)

…d-reads

mlin · 2018-09-16T21:38:18Z

@jmarshall I've incorporated these most recent suggestions. Thanks!

jmarshall · 2018-09-17T08:21:44Z

Thanks. Some notes on how the suggestions have been incorporated:

The term defined in the SAM spec is “unplaced unmapped read”. I think it's worth using that explicit term with both adjectives.
IMHO being specific in saying that the SAM spec has details of “placed and unplaced unmapped reads” is better than details of some “concept”. (Also missing punctuation.)
The comma added to if start is specified and either no reference is specified, or referenceName is "*" adds ambiguity and confusion about which of and and or has precedence.
The previous “no longer grammatical” comment of course applies to the end paragraph as well.

mlin · 2018-10-02T05:52:02Z

LAST CALL on #320 please!

…d-reads

jeromekelleher · 2018-10-02T08:37:06Z

htsget.md

-The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
+The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all records are returned regardless of position.
+
+For the reads endpoint: if "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced unmapped* reads are the subset of unmapped reads completely without location information, i.e., with RNAME and POS field values of "*" and 0 respectively. See the [SAM specification](http://samtools.github.io/hts-specs/SAMv1.pdf) for details of placed and unplaced unmapped reads.)


Sorry if I missed this in the discussion, but why the backslash in "*"?

Aha. Markdown, that's why. My bad; I should get some more coffee I think...

Back in the first round of review, @magicDGS figured * needed escaping to avoid it being interpreted as markdown emphasis/bold/whatever. I've just tested this in GH Pages's kramdown and GFM and both appear happy either way in this context. ¯\_(ツ)_/¯

Interesting @jmarshall; I just realized in my markdown editor. By the way, there is still an unescaped *, which also renders in pure markdown as emphasis...

There is indeed, later in that line. I've now tried commenting out let html_no_rendering=1 in my .vimrc, and see that vim also highlights that wrongly without the backslash. So I'm now in @magicDGS's camp and agree that the lowest common denominator is that we should escape all of these literal "\*" characters 😄

jeromekelleher · 2018-10-02T08:42:14Z

I've already weighed in here, but for wrapping up purposes I'm in favour of the change. I agree with the points that @jmarshall makes, but since reference=* is already embedded in the SAM/BAM/CRAM workflow it seems sensible to keep compatibility with that.

dkj · 2018-10-02T10:34:47Z

I don't think we should demand an error from the server for specifying start or end with unplaced referenceName=* requests #320 (comment) , although the server could decide to return an error. Prefer to stick with error response requirement if no referenceName and start or end specified. (But would rather this were merged as is than blocked on this!)

jmarshall · 2018-11-28T14:49:09Z

@dkj: The purpose of requiring an error in this case now is to enable the protocol to define a meaning for it in the future. (This is the same argument as here, about MAPQ for unmapped reads.)

If particular implementations wanted to experiment with defining it to mean e.g. such sharding, they should do so in a way that's obviously experimental — cf GitHub's preview APIs being enabled by particular Accept keywords.

dkj · 2018-11-28T15:37:14Z

@jmarshall, thanks: the use of Accept headers would address my concern about not being able to move the protocol on, and so I'm content with the error from the server for specifying start or end with unplaced referenceName=*

jmarshall · 2018-11-28T15:40:15Z

I hate to advocate for complexity, but in the interests of being a grown-up protocol… 😄

…d-reads

mlin · 2018-11-28T20:57:03Z

I escaped the last *, thank you for pointing it out! @magicDGS @jmarshall

Any further comments anyone?

jeromekelleher · 2018-11-29T09:03:01Z

LGTM: squash and merge?

jmarshall · 2018-11-29T09:22:31Z

There's whitespace on the blank line before “For the reads endpoint” that should be tidied up while squash/rebasing. All previously noted editorial trivia has been fixed — thanks.

yfarjoun · 2018-12-04T20:50:13Z

so...this thread has grown large...and I'm not sure this is an appropriate place for the following idea....but it might be:

in GATK we use "unmapped" to specific the reads that have no sequence location (so neither they nor their mates are mapped). We can run tools only on the unmapped reads by specifying -L unmapped on the command-line.

This will break is someone has a sequence named "unmapped"....so I propose adding unmapped to the list of un-allowed sequence names, and then this API can also use unmapped to access the unmapped reads.....

jeromekelleher · 2019-01-23T17:07:05Z

Final agreement for this PR to be merged on the call today. Squash and merge?

mlin · 2019-01-23T18:00:07Z

@yfarjoun the idea is relevant to this htsget-specific PR in the way you mention, but it also pertains to the whole SAM spec so I think it should be discussed at top level. The use of the * symbol here (despite the unfortunate potential for confusion in URL encoding discussed above!) is justified by its existing use for that purpose in SAM.

jmarshall · 2019-02-01T14:25:58Z

@mlin: Please delete the branch when it is no longer relevant.

jmarshall · 2019-03-11T11:10:51Z

@mlin: Please delete this branch if it is no longer relevant.

mlin added 2 commits June 1, 2018 07:52

htsget: specify how to request unmapped (unplaced) reads

4511601

Update htsget.md

ae4bacf

mlin added the htsget label Jun 1, 2018

magicDGS reviewed Jun 4, 2018

View reviewed changes

jmarshall mentioned this pull request Jun 5, 2018

Header and data boolean flags for htsget. #311

Closed

mlin added 2 commits July 11, 2018 01:49

escape asterisks

90341e3

Update htsget.md

43c0856

jmarshall reviewed Jul 11, 2018

View reviewed changes

mlin added 3 commits July 11, 2018 09:16

Update htsget.md

a624066

Update htsget.md

68223c7

Update htsget.md

6c5f68b

jmarshall mentioned this pull request Sep 4, 2018

Make "future enhancements" an unordered list #335

Merged

mlin added 2 commits September 16, 2018 14:36

incorporate @jmarshall suggestions

91b27ae

Merge remote-tracking branch 'origin/master' into mlin-htsget-unmappe…

251f197

…d-reads

incorporate @jmarshall suggestions

aef134f

Merge remote-tracking branch 'origin/master' into mlin-htsget-unmappe…

a84b1dd

…d-reads

jeromekelleher reviewed Oct 2, 2018

View reviewed changes

mlin added 2 commits November 28, 2018 08:55

Merge remote-tracking branch 'origin/master' into mlin-htsget-unmappe…

2cfb6cc

…d-reads

escape *

582a65a

fix whitespace

f913444

yfarjoun mentioned this pull request Dec 6, 2018

Disallow commas and other punctuation in RNAME etc #333

Merged

mlin merged commit c773172 into master Jan 23, 2019

mlin deleted the mlin-htsget-unmapped-reads branch March 11, 2019 17:27

htsget: specify how to request unmapped (unplaced) reads #320

htsget: specify how to request unmapped (unplaced) reads #320

Conversation

mlin commented Jun 1, 2018

mlin commented Jun 1, 2018

lbergelson commented Jun 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarshall commented Jun 5, 2018

jmarshall commented Jun 5, 2018 • edited Loading

saupchurch commented Jun 14, 2018

lbergelson commented Jun 14, 2018 • edited Loading

jkbonfield commented Jun 15, 2018

lbergelson commented Jun 15, 2018 • edited Loading

Choose a reason for hiding this comment

mlin commented Jul 11, 2018

jeromekelleher commented Jul 11, 2018

jmarshall commented Jul 11, 2018

jeromekelleher commented Jul 11, 2018

jmarshall commented Jul 11, 2018

jeromekelleher commented Jul 11, 2018

mlin commented Jul 13, 2018

jmarshall commented Jul 19, 2018 • edited Loading

jmarshall commented Sep 4, 2018

jmarshall commented Sep 4, 2018 • edited Loading

mlin commented Sep 16, 2018

jmarshall commented Sep 17, 2018

mlin commented Oct 2, 2018

Choose a reason for hiding this comment

jeromekelleher Oct 2, 2018 • edited Loading

Choose a reason for hiding this comment

jmarshall Oct 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Oct 2, 2018

dkj commented Oct 2, 2018

jmarshall commented Nov 28, 2018

dkj commented Nov 28, 2018

jmarshall commented Nov 28, 2018

mlin commented Nov 28, 2018

jeromekelleher commented Nov 29, 2018

jmarshall commented Nov 29, 2018

yfarjoun commented Dec 4, 2018 • edited Loading

jeromekelleher commented Jan 23, 2019

mlin commented Jan 23, 2019

jmarshall commented Feb 1, 2019

jmarshall commented Mar 11, 2019

jmarshall commented Jun 5, 2018 •

edited

Loading

lbergelson commented Jun 14, 2018 •

edited

Loading

lbergelson commented Jun 15, 2018 •

edited

Loading

jmarshall commented Jul 19, 2018 •

edited

Loading

jmarshall commented Sep 4, 2018 •

edited

Loading

jeromekelleher Oct 2, 2018 •

edited

Loading

jmarshall Oct 2, 2018 •

edited

Loading

yfarjoun commented Dec 4, 2018 •

edited

Loading