Header and data boolean flags for htsget. #311

jeromekelleher · 2018-05-13T14:39:43Z

This is a rough draft of the proposed change to allow htsget conditionally exclude either header or data body information. This solves two immediate problems:

Clients can request just the header from a particular ID allowing them to find out what reference names are in there. At the moment, the client needs to "know" that a particular BAM has "chr1" in there to make queries.
Clients can request just the data body and skip the headers. This is important for read browsers, as the overhead of repeatedly pulling down the headers is significant.

Issues:

I decided to go with a pair of booleans rather than an enum, as it seemed cleaner. Entirely open to suggestions otherwise here.
I felt I needed to add in a section at the top of the file trying to make it a bit clearer what we mean by 'header' and 'data'. The terminology is unfortunate here as we're using 'headers' repeatedly elsewhere to refer the HTTP headers, and this is definitely confusing. 'Data' is also tremendously vague. Any suggestions for better names would be much appreciated.
I haven't dealt with the case of whether we should make it part of the spec that servers MUST support this feature. I think it's a lot less useful to clients if they have to build in the logic of needing to check whether the server ignored their request to drop the headers of not, so I'm inclined to say we should make this a hard-requirement for conformance to the spec. It doesn't seem like a very hard feature for servers to implement, so this seems OK to me. I'm curious to see what others think though.

cyenyxe · 2018-05-13T19:44:16Z

Does the proposal imply that (BAM, VCF, etc) headers are included in each data block by default, and excluded via a flag? If so, I think it could be easier to separate the header and the data URLs in the htsget response. That way repeated downloads would only occur if the client retrieves the contents of the header URL more than once.

The terminology is unfortunate here as we're using 'headers' repeatedly elsewhere to refer the HTTP headers, and this is definitely confusing. 'Data' is also tremendously vague. Any suggestions for better names would be much appreciated.

Maybe 'metadata' instead of 'headers', and 'body' instead of 'data'? Specifying 'HTTP headers' instead of just 'headers' could help.

jmarshall · 2018-05-14T13:46:02Z

Terminology-wise, within the files we often talk about headers vs. records. So:

HTTP headers
SAM/VCF headers / data headers(?)
reads / variants / SAM/VCF records / alignment records / variant records

jmarshall · 2018-05-14T13:56:16Z

The major problem with the body-only part of this is, as I think we've discussed in the past, a file with no headers is not itself a valid BAM file; similarly for VCF and BCF files.

“Headers-only” is just a BAM/CRAM/BCF file that has no records, which is trivially valid. So that's easy.

The converse is not true, which is why we've punted on it in the past. It may finally be time to have that conversation, but

Maybe we should do ”headers-only” by itself first (perhaps with a syntax like what you've proposed that has an obvious extension to encompass “body-only” too).
There will need to be a lot more detail about what the format of a headerless BAM or BCF file is. The obvious thing to do (for BAM) would be to have a BGZF stream that just starts with the first BAM record, but existing parsers will choke on that and it may or may not be the best approach — for one thing, the content is no longer self-identifying.

It may be that the best place for that detail would be in the respective BAM, CRAM, or BCF specifications rather than in the htsget spec.

jkbonfield · 2018-05-14T14:02:29Z

I feel the potential gain outweighs the purity of data-only not being well formed. The PR already states that normally both will be present for a well formed file, and gives an example where requesting body without header may be an appropriate thing to do (ie we already have the header and it's a sizeable overhead if we have a small query region). So I don't think this is a major issue. It's one we proposed very early on as I recall (I remember it from back when I was part of the htsget group).

I don't think we need any more for your part 2 other than to state that parsers may need to prepend the data stream with some previously downloaded headers. (I say "may" as the alternative is a parser than can save state and restart from just after the headers have been parsed.)

I would add however that we need to be clear which headers are involved. CRAM has multiple tiers - the SAM header, the container header, a compression header and a slice header. It's only the first which makes sense to omit.

jeromekelleher · 2018-05-14T19:53:03Z

@cyenyxe:

Does the proposal imply that (BAM, VCF, etc) headers are included in each data block by default, and excluded via a flag? If so, I think it could be easier to separate the header and the data URLs in the htsget response. That way repeated downloads would only occur if the client retrieves the contents of the header URL more than once.

This is a neat idea... I guess it comes down to whether we want to add a flag to the URL objects to indicate whether payloads are header or data objects. It puts some restrictions on servers though I guess, since they now have to be sure to separate the two. Also, in the case where a client just wants to get the header from some big file, the server is going to spend a lot of effort encoding hundreds of URLS into the JSON response which are just going to be ignored by the client. The client saying 'all I want is the header' up front seems a bit cleaner.

The major problem with the body-only part of this is, as I think we've discussed in the past, a file with no headers is not itself a valid BAM file; similarly for VCF and BCF files.

@jmarshall: This is true, but I think I agree with @jkbonfield here. If the client explicitly asks for body-only, then I think it's fair enough to assume that the client either knows how to consume the raw stream or knows that it should prepend an earlier downloaded header to the stream before passing it on to someone else.

It's a valid point though about whether we should bother with the body-only bit of this right now...

Re terminology: I don't have strong opinions about what the terms should be, and happy to go with what others think.

jeromekelleher · 2018-05-14T19:53:43Z

Pinging @jrobinso: would you mind taking a look at this proposal and seeing if it meets your requirements re fetching only headers or body data? I was mostly thinking about you as the potential user for this, so it'd be great to get your thoughts on it.

jrobinso · 2018-05-15T03:12:47Z

The major need here is to be able to fetch the header before doing a query, since that is the only way to know what sequences are present in the resource. The equivalent of a -H flag in samtools. WRT what to return for CRAM files, really on the "SAM" format type header is required. I haven't tested samtools -H with a cram file but I imagine that is what is returned.

Fetching only the body would be a performance improvement, not strictly required. I would point out that fetching just the "data" via a query is the standard behavior for samtools, you have to specifically ask for the header with a -h flag. I'm sure Picard has the equivalent Again this is a well and long established pattern, nothing new needs to be invented here. WRT what to return, if you decide to implement this part I would return exactly what samtools does, many pipelines and tools are already written that use samtools to do queries.

jrobinso · 2018-05-15T03:14:23Z

To refine and clarify, IGV really just needs the sequence dictionary. I suggested the entire header because that is what samtools does, but I don't use the rest of it. But for the sake of tools built around samtools I think you should do what it (samtools) does.

jeromekelleher · 2018-05-15T07:31:16Z

Thanks @jrobinso, this is very helpful.

jmarshall · 2018-05-16T15:37:01Z

@cyenyxe: That's a rather nice alternative approach.

Another alternative approach for the headers-only request is to follow the lead of our reference-retrieval colleagues, who have separate endpoints for sequences and just the metadata:

GET /sequence/<id>
GET /sequence/<id>/metadata

The exact analogue is more difficult for us because we don't constrain our <id>s — theirs are a single-word checksum whereas ours are likely a full path with / characters in it.

Edited to add: To avoid ambiguities due to our full-path ids, we could have endpoints like

GET /reads-metadata/<id>
GET /variants-metadata/<id>

but this is really no more satisfying than this PR's proposed query parameter approach.

jeromekelleher · 2018-05-16T16:50:43Z

In the meeting on 16/05 we decided to reduce the scope of this back to requesting header data only. I'll update the PR soon.

jeromekelleher · 2018-05-16T17:08:49Z

I've updated this to have a headers_only flag in the GET vars now, removing the headers and data booleans. This is just to show what we intend to do though, as I think we should get a PR in first with the definitions of header and data included, along with the extra flag in the URL objects telling clients what each URL is.

Once that's done, adding this headers_only flag will be a very simple addition.

jmarshall · 2018-06-05T10:55:27Z

Rather than headers_only=1 we could use a more generic parameter with a distinctive value: category=headers or something similar.

Then we would be able to reuse this parameter for other purposes too. For example PR #320 could use this as category=unplaced — though other opinions as to whether it is a good idea to unify these disparate concepts like this would be welcome.

jeromekelleher · 2018-06-06T16:26:07Z

Seems like a good idea to me @jmarshall. The word category doesn't feel quite right to me, but I don't have a better suggestion.

jmarshall · 2018-06-06T16:42:38Z

It doesn't feel quite right to me either, so let's all keep ruminating… 😄

If @cyenyxe's upcoming data-only PR adds to the JSON ticket in the way I'm vaguely anticipating, we could reuse its wording: I think that'll be class=header which might not feel quite right for this either…

jrobinso · 2018-06-07T14:09:51Z

Any of these suggestions would work for me, but I suggest you consider supporting the samtools options, specifically -H for this instance. Many tools are built on samtools and/or pysam. I imagine htsget implementations would be built around these as well.

…

On Wed, Jun 6, 2018 at 12:42 PM, John Marshall ***@***.***> wrote: It doesn't feel quite right to me either, so let's all keep ruminating… 😄 If @cyenyxe <https://github.com/cyenyxe>'s upcoming data-only PR adds to the JSON ticket in the way I'm vaguely anticipating, we could reuse its wording: I think that'll be class=header which might not feel quite right for this either… — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#311 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA49HBsQeAz0SeLt-eW50kYbP3dRqn3aks5t6AaCgaJpZM4T80zC> .

jrobinso · 2018-06-07T14:16:55Z

Sorry I hit send prematurely. My point is that supporting samtools flag equivalents would satisfy the needs of tools built on samtools (or pysam) by definition, and should be trivial to implement for htsget implementations built on samtools or htslib. This is a large number of potential htsget clients. I suppose you could make the same argument for Picard flags, I am just less familiar with those. However, for IGV purposes any option that would return the sequence dictionary in some form would suffice. Any of these suggestions would work for me, but I suggest you consider

…

supporting the samtools options, specifically -H for this instance. Many tools are built on samtools and/or pysam. I imagine htsget implementations would be built around these as well.

jmarshall · 2018-06-07T14:31:01Z

This is about how to specify these things in the underlying HTTP-based protocol.

When samtools is being used as a htsget client, it would be able to internally translate -H being specified into adding &class=header or whatever onto the end of the URL it was requesting and so take advantage of only transmitting the headers. (At the moment it wouldn't do that because it wouldn't know ahead of time that a given URL was an htsget URL, but there could be ways of telling it that — such as command-line option.)

Similarly Picard could translate its options into adding the appropriate bits and pieces onto htsget URLs when htsget is in use.

jrobinso · 2018-06-07T18:43:51Z

I wasn't thinking about samtools or Picard as the client, but other tools built on them that already know, for example, how to process the results returned from "-H". To ease adoption of htsget it would be helpful if it had an equivalent flag that returned the same things. That's all, nothing complicated, and probably not a big deal either way.

…

On Thu, Jun 7, 2018 at 10:31 AM, John Marshall ***@***.***> wrote: This is about how to specify these things in the underlying HTTP-based protocol. When samtools is being used as a htsget client, it would be able to internally translate -H being specified into adding &class=header or whatever onto the end of the URL it was requesting and so take advantage of only transmitting the headers. (At the moment it wouldn't do that because it wouldn't know ahead of time that a given URL was an htsget URL, but there could be ways of telling it that — such as command-line option.) Similarly Picard could translate its options into adding the appropriate bits and pieces onto htsget URLs when htsget is in use. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#311 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA49HOe42jWZlib0HtQTZxPHzAMyZzwTks5t6TkpgaJpZM4T80zC> .

jeromekelleher · 2018-06-13T16:38:17Z

In the interest of simplicity we're closing this PR in favour of #322.

jmarshall added the htsget label May 14, 2018

Header and data boolean flags for htsget.

5358ff0

jeromekelleher force-pushed the htsget-headers branch from e5888af to 5358ff0 Compare May 16, 2018 17:05

jmarshall mentioned this pull request Jun 5, 2018

htsget: specify how to request unmapped (unplaced) reads #320

Merged

jeromekelleher mentioned this pull request Jun 11, 2018

Added class to URLs in the response #322

Merged

jeromekelleher closed this Jun 13, 2018

jeromekelleher deleted the htsget-headers branch June 13, 2018 16:38

This was referenced Jun 12, 2021

htsget should require that returned SAM/VCF headers include @SQ/##contig #578

Open

Is the variants endpoint of htsget protocal supported? igvteam/igv.js#1187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header and data boolean flags for htsget. #311

Header and data boolean flags for htsget. #311

jeromekelleher commented May 13, 2018

cyenyxe commented May 13, 2018 •

edited

Loading

jmarshall commented May 14, 2018

jmarshall commented May 14, 2018

jkbonfield commented May 14, 2018

jeromekelleher commented May 14, 2018

jeromekelleher commented May 14, 2018

jrobinso commented May 15, 2018

jrobinso commented May 15, 2018

jeromekelleher commented May 15, 2018

jmarshall commented May 16, 2018 •

edited

Loading

jeromekelleher commented May 16, 2018

jeromekelleher commented May 16, 2018

jmarshall commented Jun 5, 2018

jeromekelleher commented Jun 6, 2018

jmarshall commented Jun 6, 2018

jrobinso commented Jun 7, 2018 via email

jrobinso commented Jun 7, 2018 via email

jmarshall commented Jun 7, 2018

jrobinso commented Jun 7, 2018 via email

jeromekelleher commented Jun 13, 2018

Header and data boolean flags for htsget. #311

Header and data boolean flags for htsget. #311

Conversation

jeromekelleher commented May 13, 2018

cyenyxe commented May 13, 2018 • edited Loading

jmarshall commented May 14, 2018

jmarshall commented May 14, 2018

jkbonfield commented May 14, 2018

jeromekelleher commented May 14, 2018

jeromekelleher commented May 14, 2018

jrobinso commented May 15, 2018

jrobinso commented May 15, 2018

jeromekelleher commented May 15, 2018

jmarshall commented May 16, 2018 • edited Loading

jeromekelleher commented May 16, 2018

jeromekelleher commented May 16, 2018

jmarshall commented Jun 5, 2018

jeromekelleher commented Jun 6, 2018

jmarshall commented Jun 6, 2018

jrobinso commented Jun 7, 2018 via email

jrobinso commented Jun 7, 2018 via email

jmarshall commented Jun 7, 2018

jrobinso commented Jun 7, 2018 via email

jeromekelleher commented Jun 13, 2018

cyenyxe commented May 13, 2018 •

edited

Loading

jmarshall commented May 16, 2018 •

edited

Loading