-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the variants endpoint of htsget
protocal supported?
#1187
Comments
There is no htsget support for variants, if it works its an accident. Are there any servers out there? htsget has been the definition of slow-burner. BTW 'ga4gh" api is not supported anymore, that's the old ga4gh schema not htsget. |
Yes, there are: https://htsget.ga4gh.org/variants/service-info |
@brainstorm Sorry maybe I wasn't clear. Are there any public endpoints I can use? I don't see how the json you reference addresses the question but maybe I'm missing something. I'm not interested in running a server myself. |
Far point plus it seems like the documentation link on that JSON result is 404'ing (/cc @jb-adams), which makes understanding how those endpoints work unnecessarily cumbersome. A clear and stable example that can be accessed easily from the service-info endpoint would help a ton to casual developers on the public server, @jb-adams. That being said, here you have an example of use while the service-info documentation for this public endpoint gets fixed: https://htsget.ga4gh.org/variants/1000genomes.phase1.chrY The way I got to that test URL is by looking at the htsget-refserver config over here: Notice that the regexp allows you to query other chromosomes. Hope it's clearer now? This should help test #1344 and fix our UMCCR backed in umccr/data-portal-client#64. |
@brainstorm That's helpful but I don't know how you can derive that from the info json. As far as I know there are only 2 services described in the API, reads and variant (http://samtools.github.io/hts-specs/htsget.html).. If the endpoint you list above follows the recommended pattern I suppose the "id" of the data is What I'm really looking for now is a "reads" endpoint that is stable and that I can test against. I think Google had one at one time but I can't find it anymore. I might guess that |
Yeah, the reads endpoint as you were getting at follows pretty much the same logic with regexps from the config file I referred to. I.e, check this one for reads: https://htsget.ga4gh.org/reads/giab.NA12878.NIST7086.1 I had exactly the same questions with the public htsget reference server, so it clearly needs a better/clearer way to onboard new htsget client developers, /cc @jb-adams |
OK that worked, finally something to test against. Its been more than 3 years since that was implemented and there isn't much evidence of it being used, but its nice to get a minimal test for reads working again. |
The documentation page is now being hosted on the same server as the htsget service itself. Please try this new documentation link: https://htsget.ga4gh.org/docs/index.html The page gives a breakdown of "Reads Datasets" and "Variants Datasets" and some of the ids the reference server supports. Please let me know if this documentation is sufficient to construct test queries, or if more info is needed. |
@jb-adams, the service info endpoint documentation URL is still 404'ing: Would you mind redeploying the official GA4GH public htsget server accordingly to the new docs to avoid confusion? |
@brainstorm good catch :) I will try to redeploy the server today |
@brainstorm @jb-adams Thanks for the pointers to test data and variant spec. I'm looking at the variant service now. Right away I notice it has the same, in my opinion, fundamental problem as the reads service. I raised this years ago, before it was even part of GA4GH. It was going to be discussed but I never learned of the results if any. The basic problem is there is no way to discover what referenceNames are present in a dataset, and calling the service with a wrong reference name is a 400 error! This is a very nasty thing to do to clients. As everyone knows there are 2 common reference naming conventions, "chr1" and "1", and using the "wrong" one WRT any given dataset will throw an error. Ditto, I assume, querying for a genomic range that is outside of the dataset. This is not an error IMO, there's simply nothing "there" so an object noting an empty result should be returned. The example, BTW, uses "chr1" but that is going to give a 400 error with any of the example datasets. I realize this isn't the right forum to raise this. Where would the right forum be? |
@jb-adams. Where is the official doc? The 1.4.1 page you reference above does not seem to describe a "ticket". Coming in through the GA4GH home page I get a pointer to https://samtools.github.io/hts-specs/htsget.html which does describe a "ticket". I'm finding strange behavior with the reference variant service and don't know if its a misunderstanding on my part of the spec, a bug in the reference server, or both. Does the reference server have a git project where issues can be raised? A couple of confusing things, I'm getting different "tickets" if I do or do not include a reference range in the initial query. I would expect to get back, perhaps, more URLs if I do not than if I do, but they are actually different in form. The second problem I have is retrieving the vcf header, the parameter. "class=header" does not seem to work. Anyway this isn't the place to discuss that, but where is? Thanks for all your hard work! |
@jb-adams final info for this evening, this is what I'm trying to get working
Just can't get there. Following the URLs from the ticket gets me a portion of the VCF file containing that region, with random bits before and after, but no VCF header. Its almost like its just a raw tabix query. The "class=header" directive doesn't seem to do anything. |
@jrobinso, please see if this help.
|
It depends, for the htsget spec itself, I've just opened samtools/hts-specs#578 for your remarks and good points about refName discoverability for clients. For things that pertain to the particular reference implementation of htsget, I'd head to: https://github.com/ga4gh/htsget-refserver I agree with you that htsget should allow some sort of basic discovery/enumeration of refNames, I'm just wondering to what extent we should include this as a basic (and incomplete) mechanism for refNames only or if there'll be feature creep with other facets of the formats and datasets that would be interesting to make them discoverable... and the multiple rabbit holes that'd entail. |
@jrobinso wrote:
As you've realised in this second comment, If Making a header query gave me approximately the sort of ticket I was expecting:
i.e., a single URL representing the VCF headers.
returns text VCF headers as expected. (It's slightly odd that the different URL returned within the ticket still has |
@jmarshall thanks for the response, and for the "class=header" implementation. Yes I did request that, and its an indirect way to get reference names for alignments, which works perfectly for me. However the reference names are not in a VCF header, at least they aren't required to be. |
@victorskl Thanks, yes that is helpful. I will give this another try next week. |
@brainstorm re "rabbit holes", a call with a referenceName that isn't in the dataset is an actual error, if its an error the legal names should be discoverable. I've long had a workaround in place obviously, and there is a solution now for BAM/CRAM with the header option, so I'll let this one die. |
For header only, I tried this way and it works for me, too.
Yep; I reckon, this is due to this underlay dataset itself?
i.e. note in filename
Would you mind try
And this dataset work better, I reckon? At least, we use this variant dataset for Htsget + Passport experiment. HTH |
I have this working via node unit tests, however its not working in the browser because there doesn't seem to be CORS headers on the responses, at least for this URL. @jb-adams I will raise on issue on the test server repo
|
Variant support has been added. The configuration below should work once CORS is implemented on the reference server.
|
@jb-adams Time to review and merge ga4gh/htsget-refserver#24 ? :) |
@victorskl @brainstorm I'm not sure I'm happy with the form of the config above, I'm mulling over changing it to a complet url string, rather than a url + ID, this would be more consistent with other URL based sources. I would of course try to figure out how to do it in a backward compatible way, but with almost no servers extant I think @victorskl might be the only actual user of htsget through igv.js at the moment. Any thoughts? So for example.
|
Just reminder, currently is I think, this is ok. I would say, it will become a complete |
I will support the following options going forward, not backward compatible but we're allowed to break it once. {
type: 'alignment',
sourceType: 'htsget',
format: 'bam',
url: 'https://htsget.ga4gh.org/reads/giab.NA12878.NIST7086.1',
name: 'NA12878'
} {
type: 'alignment',
sourceType: 'htsget',
format: 'bam',
endpoint: 'https://htsget.ga4gh.org/reads/',
id: 'giab.NA12878.NIST7086.1',
name: 'NA12878'
} |
| @jb-adams Time to review and merge ga4gh/htsget-refserver#24 ? :) I merged this into e.g. a Is this something I need to configure on my end? I am simply running the new build with all defaults (no config) |
Yes, any issues with the reference server can be raised here -> https://github.com/ga4gh/htsget-refserver/issues. If it's an issue with the htsget spec itself (such as no way to discover referenceNames), then that issue is best raised on the hts-specs repo (where the htsget spec is housed) -> https://github.com/samtools/hts-specs/issues The reference server aims to stay closely aligned with the spec, and not implement any features that are not described in the spec
So long as you are able to concatenate each filepart returned by each ticket, and the final concatenated result is a valid VCF, this is expected behaviour. Non-reference range queries are simpler to process, and generally the server just refers the client to the data source (such as an S3 URL), because no VCF processing is required. For reference range requests, the server itself needs to process and stream back only the requested ranges, and the server does this by splitting tickets according to genomic reference.
This may be an error with the server, and it would be good to raise an issue on the repo |
@jmarshall @jrobinso I'm not quite seeing why the
If I construct a request in Postman using the provided URL and apply all 3 |
@jb-adams I think that was user error on my part, @jmarshall might have found a different issue. I had 2 expectations that were false, (1) I expected to receive the header, not a "ticket" for the header, (2) I expected to receive the header as text, not as bgz compressed data. Its now working fine with my newly calibrated expectations, well with the exception of the CORS issue raised elsewhere. |
@jb-adams RE the different urls for whole file vs range, a word of caution. Many organizations strip range headers from outgoing client requests, if the entire file is requested a url for the entire file (rather than multiple parts with range requests) will have a higher chance of success. We actually run a service for IGV desktop to work around this, its so common, and the service still gets many hits. |
@jb-adams: This was user error on my part too, as described in an update to #1187 (comment). This was my fault, but for pedagogical purposes you might consider making the different URLs used by the example server a bit more visually obviously different! @jrobinso: We considered and rejected having this request return the header directly rather than via a ticket; see samtools/hts-specs#322 (comment) onwards. The htslib client implementation sniffs the returned data to determine whether it is a ticket, but other clients may not; so it was considered that clients making an htsget request — therefore expecting a ticket — would be reasonably surprised and disgruntled to receive something other than a ticket! My curl command did receive the header as text, and when I asked for BCF I got “file format: 'BCF' not supported”. Was what you got either binary BCF data after decompression or definitely bgzipped? Might it instead be that the text response had been plain gzipped by a proxy on the way back to you or something like that? (If so, that's still something unexpected that the spec might need to mention…) As for range headers: refget uses range headers quite heavily, and one anticipated htsget server implementation is to return a ticket with an array of urls pointing to the same large complete vcf.gz file but differing in the |
@jmarshall Sorry if I'm causing confusing, its the BAM header that is bgzipped, not the variant file. I expected to get a plain text SAM header from the /reads/ endpoint with class=header, but OTOH the request is for "bam" format so my expectation is not reasonable. It is bgzipped, but that's what I asked for. I don't think stripping range and other headers is that unusual, crazy is a value judgement. The consortium of Boston hospitals here does that for example, client requests are all run through a proxy, for example something called "Squid", and non-whitelisted headers are stripped. The range header is almost never whitelisted by default. We have managed to get our servers whitelisted from this stripping, but its a hassle. One value add htsget potentially provides over just hosting indexed files is a means to do range queries on bam and vcf files without the use of "range" headers, if the ticket returns urls with these headers that value add is negated. This is not a spec thing, its an implementation consideration, if implementing a server that is going to provide public data I would avoid use of "range headers", or any header not necessary, because the callback requests won't necessarily have them. |
Thank you for clarifying that that bit was about reads and BAM, not variants and VCF. It's slightly unfortunate that you can't currently ask for That was indeed a tongue-in-cheek value judgement, hence the smiley. If you would like the htsget spec maintainers to be aware of those header-related implementation considerations and for them to consider adding a note discouraging such implementations, please raise your second paragraph as an hts-specs issue. |
Thanks for merging, Jeremy @jb-adams Nop; you do not need to configure if you just run it locally with default settings. Default setting allows CORS from Please try as follows:
Verify CORS with curl as follows:
HTH |
Hi all, this thread has now diverged a bit, but variants are supported now in master so I'm closing this. See dev/htsget.html for a browser example, and test/testHtsgetReader.js for some (minimal) node unit tests. |
Thanks everyone for your work into this. I clearly missed out on a long and interesting conversation 😂 |
The documentation doesn't seem to mention the support the
Variants
of an htsget API serverI assume the variants support of htsget will just follow the same as https://github.com/igvteam/igv.js/wiki/Variant-Track?
btw, the sourceType is likely outdated
The text was updated successfully, but these errors were encountered: