Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header and data boolean flags for htsget. #311

Closed

Conversation

jeromekelleher
Copy link
Member

This is a rough draft of the proposed change to allow htsget conditionally exclude either header or data body information. This solves two immediate problems:

  1. Clients can request just the header from a particular ID allowing them to find out what reference names are in there. At the moment, the client needs to "know" that a particular BAM has "chr1" in there to make queries.
  2. Clients can request just the data body and skip the headers. This is important for read browsers, as the overhead of repeatedly pulling down the headers is significant.

Issues:

  • I decided to go with a pair of booleans rather than an enum, as it seemed cleaner. Entirely open to suggestions otherwise here.
  • I felt I needed to add in a section at the top of the file trying to make it a bit clearer what we mean by 'header' and 'data'. The terminology is unfortunate here as we're using 'headers' repeatedly elsewhere to refer the HTTP headers, and this is definitely confusing. 'Data' is also tremendously vague. Any suggestions for better names would be much appreciated.
  • I haven't dealt with the case of whether we should make it part of the spec that servers MUST support this feature. I think it's a lot less useful to clients if they have to build in the logic of needing to check whether the server ignored their request to drop the headers of not, so I'm inclined to say we should make this a hard-requirement for conformance to the spec. It doesn't seem like a very hard feature for servers to implement, so this seems OK to me. I'm curious to see what others think though.

@cyenyxe
Copy link
Member

cyenyxe commented May 13, 2018

Does the proposal imply that (BAM, VCF, etc) headers are included in each data block by default, and excluded via a flag? If so, I think it could be easier to separate the header and the data URLs in the htsget response. That way repeated downloads would only occur if the client retrieves the contents of the header URL more than once.

The terminology is unfortunate here as we're using 'headers' repeatedly elsewhere to refer the HTTP headers, and this is definitely confusing. 'Data' is also tremendously vague. Any suggestions for better names would be much appreciated.

Maybe 'metadata' instead of 'headers', and 'body' instead of 'data'? Specifying 'HTTP headers' instead of just 'headers' could help.

@jmarshall
Copy link
Member

Terminology-wise, within the files we often talk about headers vs. records. So:

  • HTTP headers
  • SAM/VCF headers / data headers(?)
  • reads / variants / SAM/VCF records / alignment records / variant records

@jmarshall
Copy link
Member

The major problem with the body-only part of this is, as I think we've discussed in the past, a file with no headers is not itself a valid BAM file; similarly for VCF and BCF files.

“Headers-only” is just a BAM/CRAM/BCF file that has no records, which is trivially valid. So that's easy.

The converse is not true, which is why we've punted on it in the past. It may finally be time to have that conversation, but

  1. Maybe we should do ”headers-only” by itself first (perhaps with a syntax like what you've proposed that has an obvious extension to encompass “body-only” too).

  2. There will need to be a lot more detail about what the format of a headerless BAM or BCF file is. The obvious thing to do (for BAM) would be to have a BGZF stream that just starts with the first BAM record, but existing parsers will choke on that and it may or may not be the best approach — for one thing, the content is no longer self-identifying.

    It may be that the best place for that detail would be in the respective BAM, CRAM, or BCF specifications rather than in the htsget spec.

@jkbonfield
Copy link
Contributor

I feel the potential gain outweighs the purity of data-only not being well formed. The PR already states that normally both will be present for a well formed file, and gives an example where requesting body without header may be an appropriate thing to do (ie we already have the header and it's a sizeable overhead if we have a small query region). So I don't think this is a major issue. It's one we proposed very early on as I recall (I remember it from back when I was part of the htsget group).

I don't think we need any more for your part 2 other than to state that parsers may need to prepend the data stream with some previously downloaded headers. (I say "may" as the alternative is a parser than can save state and restart from just after the headers have been parsed.)

I would add however that we need to be clear which headers are involved. CRAM has multiple tiers - the SAM header, the container header, a compression header and a slice header. It's only the first which makes sense to omit.

@jeromekelleher
Copy link
Member Author

@cyenyxe:

Does the proposal imply that (BAM, VCF, etc) headers are included in each data block by default, and excluded via a flag? If so, I think it could be easier to separate the header and the data URLs in the htsget response. That way repeated downloads would only occur if the client retrieves the contents of the header URL more than once.

This is a neat idea... I guess it comes down to whether we want to add a flag to the URL objects to indicate whether payloads are header or data objects. It puts some restrictions on servers though I guess, since they now have to be sure to separate the two. Also, in the case where a client just wants to get the header from some big file, the server is going to spend a lot of effort encoding hundreds of URLS into the JSON response which are just going to be ignored by the client. The client saying 'all I want is the header' up front seems a bit cleaner.

The major problem with the body-only part of this is, as I think we've discussed in the past, a file with no headers is not itself a valid BAM file; similarly for VCF and BCF files.

@jmarshall: This is true, but I think I agree with @jkbonfield here. If the client explicitly asks for body-only, then I think it's fair enough to assume that the client either knows how to consume the raw stream or knows that it should prepend an earlier downloaded header to the stream before passing it on to someone else.

It's a valid point though about whether we should bother with the body-only bit of this right now...

Re terminology: I don't have strong opinions about what the terms should be, and happy to go with what others think.

@jeromekelleher
Copy link
Member Author

Pinging @jrobinso: would you mind taking a look at this proposal and seeing if it meets your requirements re fetching only headers or body data? I was mostly thinking about you as the potential user for this, so it'd be great to get your thoughts on it.

@jrobinso
Copy link

The major need here is to be able to fetch the header before doing a query, since that is the only way to know what sequences are present in the resource. The equivalent of a -H flag in samtools. WRT what to return for CRAM files, really on the "SAM" format type header is required. I haven't tested samtools -H with a cram file but I imagine that is what is returned.

Fetching only the body would be a performance improvement, not strictly required. I would point out that fetching just the "data" via a query is the standard behavior for samtools, you have to specifically ask for the header with a -h flag. I'm sure Picard has the equivalent Again this is a well and long established pattern, nothing new needs to be invented here. WRT what to return, if you decide to implement this part I would return exactly what samtools does, many pipelines and tools are already written that use samtools to do queries.

@jrobinso
Copy link

To refine and clarify, IGV really just needs the sequence dictionary. I suggested the entire header because that is what samtools does, but I don't use the rest of it. But for the sake of tools built around samtools I think you should do what it (samtools) does.

@jeromekelleher
Copy link
Member Author

Thanks @jrobinso, this is very helpful.

@jmarshall
Copy link
Member

jmarshall commented May 16, 2018

@cyenyxe: That's a rather nice alternative approach.

Another alternative approach for the headers-only request is to follow the lead of our reference-retrieval colleagues, who have separate endpoints for sequences and just the metadata:

GET /sequence/<id>
GET /sequence/<id>/metadata

The exact analogue is more difficult for us because we don't constrain our <id>s — theirs are a single-word checksum whereas ours are likely a full path with / characters in it.

Edited to add: To avoid ambiguities due to our full-path ids, we could have endpoints like

GET /reads-metadata/<id>
GET /variants-metadata/<id>

but this is really no more satisfying than this PR's proposed query parameter approach.

@jeromekelleher
Copy link
Member Author

In the meeting on 16/05 we decided to reduce the scope of this back to requesting header data only. I'll update the PR soon.

@jeromekelleher
Copy link
Member Author

I've updated this to have a headers_only flag in the GET vars now, removing the headers and data booleans. This is just to show what we intend to do though, as I think we should get a PR in first with the definitions of header and data included, along with the extra flag in the URL objects telling clients what each URL is.

Once that's done, adding this headers_only flag will be a very simple addition.

@jmarshall
Copy link
Member

Rather than headers_only=1 we could use a more generic parameter with a distinctive value: category=headers or something similar.

Then we would be able to reuse this parameter for other purposes too. For example PR #320 could use this as category=unplaced — though other opinions as to whether it is a good idea to unify these disparate concepts like this would be welcome.

@jeromekelleher
Copy link
Member Author

Seems like a good idea to me @jmarshall. The word category doesn't feel quite right to me, but I don't have a better suggestion.

@jmarshall
Copy link
Member

It doesn't feel quite right to me either, so let's all keep ruminating… 😄

If @cyenyxe's upcoming data-only PR adds to the JSON ticket in the way I'm vaguely anticipating, we could reuse its wording: I think that'll be class=header which might not feel quite right for this either…

@jrobinso
Copy link

jrobinso commented Jun 7, 2018 via email

@jrobinso
Copy link

jrobinso commented Jun 7, 2018 via email

@jmarshall
Copy link
Member

This is about how to specify these things in the underlying HTTP-based protocol.

When samtools is being used as a htsget client, it would be able to internally translate -H being specified into adding &class=header or whatever onto the end of the URL it was requesting and so take advantage of only transmitting the headers. (At the moment it wouldn't do that because it wouldn't know ahead of time that a given URL was an htsget URL, but there could be ways of telling it that — such as command-line option.)

Similarly Picard could translate its options into adding the appropriate bits and pieces onto htsget URLs when htsget is in use.

@jrobinso
Copy link

jrobinso commented Jun 7, 2018 via email

@jeromekelleher
Copy link
Member Author

In the interest of simplicity we're closing this PR in favour of #322.

@jeromekelleher jeromekelleher deleted the htsget-headers branch June 13, 2018 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants