Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htsget should require that returned SAM/VCF headers include @SQ/##contig #578

Open
brainstorm opened this issue Jun 12, 2021 · 14 comments
Open
Labels

Comments

@brainstorm
Copy link
Contributor

See igvteam/igv.js#1187 for context.

TL;DR:

The basic problem is there is no way to discover what referenceNames are present in a dataset, and calling the service with a wrong reference name is a 400 error! This is a very nasty thing to do to clients. As everyone knows there are 2 common reference naming conventions, "chr1" and "1", and using the "wrong" one WRT any given dataset will throw an error. Ditto, I assume, querying for a genomic range that is outside of the dataset. This is not an error IMO, there's simply nothing "there" so an object noting an empty result should be returned.

/cc @mlin

@jmarshall
Copy link
Member

The basic problem is there is no way to discover what referenceNames are present in a dataset

As noted in #311 (comment) which motivated their addition, class=header requests provide this referenceName discovery mechanism.

@jrobinso
Copy link

@jmarshall The "class-header" option works great for reads, thanks for that, but how does it work for variants? AFAIK the only required line in a VCF header is the format directive, and I don't recall seeing a VCF with referenceNames listed in the header.

@jmarshall
Copy link
Member

jmarshall commented Jun 12, 2021

The relevant VCF header is ##contig (§1.4.7 in the v4.3 spec), which “is recommended for VCF, and required for BCF”. In my experience, it's pretty commonly supplied in the wild (though perhaps with no more additional subfields than just ##contig=<ID=chr1>).

I think there has been talk of making ##contig headers required in future VCF versions, but I can't find any reference to that in hts-specs issues or PRs.

You are correct that it would behoove htsget to specify that the VCF headers returned by a class=header request (or perhaps any variant request) SHOULD/MUST include ##contig headers, even if VCF itself does not require that in general.

IMHO that would be a more practical approach than inventing another endpoint and requiring servers to divine this information from the underlying data in some other way than looking at these headers. (Note also that something similar applies for reads: in SAM (only) it's valid for @SQ headers to be omitted, though it's very uncommon as e.g. samtools barely supports it. Htsget returns only in BAM or CRAM, so no issue due to lack of @SQ text headers arises.)

@jmarshall jmarshall reopened this Jun 12, 2021
@jmarshall jmarshall changed the title htsget should have an endpoint that enumerates refNames for clients? htsget should require that returned SAM/VCF headers include @SQ/##config Jun 12, 2021
@jmarshall jmarshall changed the title htsget should require that returned SAM/VCF headers include @SQ/##config htsget should require that returned SAM/VCF headers include @SQ/##contig Jun 12, 2021
@jrobinso
Copy link

@jmarshall Thanks, that would work for me (##contig). I just need the names, so even the short form would work. I will build in workarounds if they are missing, which obviously I have already done for alignments.

@jrobinso
Copy link

@jmarshall BTW servers must already be "divining" the reference names in a dataset since they are supposed to return an error if a request is made for a reference name that is not present. This was at the root of my requests from the beginning, if you are going to return an error shouldn't there be a way to determine what is legal in advance? In any event I have long implemented workarounds, even without the "header" option, so you don't need to keep this open for me. Thanks for your responses.

@jmarshall
Copy link
Member

BTW servers must already be "divining" the reference names in a dataset since they are supposed to return an error if a request is made for a reference name that is not present.

File-based servers will typically be using a bai/csi/tbi index to return results for requested referenceNames and coordinates; doing such an individual lookup by name is a typical operation on such an index. Enumerating all the available referenceNames is not a typical operation on these indexes (and is not possible for some indexes), so would require additional implementation. This is what I mean by additional divination being required.

Anyway, for the 1 vs chr1 problem, the real solution is for VCF to gain an equivalent of SAM's @SQ-AN field and for both to become widely deployed…

This was at the root of my requests from the beginning, if you are going to return an error shouldn't there be a way to determine what is legal in advance?

As you know, there has been a way to determine that for a little over two years.

@jrobinso
Copy link

jrobinso commented Jun 15, 2021

@jmarshall It's been more than 2 years since I last worked on this until recently, so I'm getting back up to speed. Yes there is a way for BAM files, but not for VCF, the optional ##contig header not withstanding. Currently I'm handling it this way, if there's a mismatch on a VCF file that lacks the ##contig header too bad, its just going to be an error.

@jmarshall
Copy link
Member

jmarshall commented Jun 15, 2021

There's a way for BAM files, but not for VCF

Please be explicit. What do you consider is lacking for VCF? (Other than the minor “my VCF file does not have ##contig headers” issue that this issue now represents. I believe such VCF files are unusual; the common case is for these headers to be present.)

@jrobinso
Copy link

@jmarshall Nothing, that's it, the optional ##contig header. I'm sure you're right, they are unusual, unusual files are way over represented in my world because they generally end up as IGV help tickets. In this case I'll wait for such a ticket and just assume ##contig will always be there.

@jrobinso
Copy link

Totally confused about what issue I'm responding to via email, sorry for confusion. All is good for now, if htsget is extended to other formats this might need revisited.

@jmarshall
Copy link
Member

Header requests work whether or not the files contain @SQ/##contig headers; they just may not supply the information the client is looking for, but that can be considered to be not htsget's fault. So IMHO htsget should not make hard requirements around the files it serves, but it would be useful to clarify all this by adding some (non-normative) text expanding on what header requests can be used for — e.g. something like

Requests for headers return a data stream containing read or variant headers in the format specified by format, if present, or in the respective default format. They can, for example, be used to discover the list of referenceName values applicable for this data stream, provided the data stream contains @SQ or ##contig headers.

@cmdcolin
Copy link
Contributor

cmdcolin commented Jul 7, 2021

Just as a random context, in non-htsget with tabix'ed VCF files, we used the names inside the VCF tabix file for discovery of reference sequence names. Indexes would not be available in htsget of course, but it would be nice to have some expectation (if not guarantee) that the refnames would be in a header request

@jmarshall
Copy link
Member

It would be possible to invent another form of request (e.g. class=referenceNames?) that would return the names as divined by some other method such as from tabix indexes. But htsget is parsimonious and tries to avoid invention, hence used the already-present data stream headers to enable referenceName discovery to be done, as that was considered to be the minimal design.

The reasons for adding text like that are

  1. Clarify to authors of clients that the response is in format not just textual SAM (but see also Perhaps allow format=SAM for requests on the reads endpoint #580)
  2. Jog readers' memory about ##contig headers providing this information in VCF
  3. Strongly hint to the operators of htsget servers that the variant data streams presented by their services ought to include ##contig headers.

Item 3 represents the expectation you mention; YMMV how best to express that in spec text.

@daisieh
Copy link

daisieh commented Sep 16, 2021

I'm imagining two forms of 400 error that this could be applied to:

  • InvalidHeaderFormat: header does not include information about contigs or references
  • InvalidReferenceName: requested reference name is not included in <pysam.VariantFile.VariantHeader.contigs or similar for alignments>

I can't find any indication that there is a easy way to access just the reference_names from an AlignmentHeader, but it would presumably be easy to implement in any htsget server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants