-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Announcing cdot: a data provider to load lots of transcripts fast #634
Comments
OMG, I am so thrilled that this exists! I had hoped that something like this would arise to solve clear issues with UTA currency, but also as a way to make it much easier to support custom transcripts. I haven't tried cdot yet, but the direction is spot-on. What does it take to load from RefSeq? I have ~6 years of NCBI GFF3 archives. That might be useful to generate historical transcript alignments. Kudos! And thanks for the contribution. Let's do talk about working together on this... are you on the biocommons slack workspace? Try this: https://join.slack.com/t/biocommons/shared_invite/zt-12xzu037o-abkp_8a8Ta1Sc8JaBkpyqA (expires 2022-03-02). |
Hi, as an example here's the script to download and generate RefSeq for 37: There's scripts for 38 and Ensembl in the same directory. Will join up on slack at work tomorrow, also keen to define a JSON transcript format - with the super long term goal of a GA4GH standard and pushing Ensembl and RefSeq for an API for every transcript they've ever made but yeah happy to build whatever glue is needed in the meantime. |
@davmlaw I successfully used cdot bc/ it provides a flat file for transcript mapping, and thus in principle no internet connection to map coordinates from say coding to genomic. However, this fails w/o a connection:
You mentioned
so what am I doing wrong with the above snippet? To clarify, my aim is to map coding to genomic coords without access to an internet connection. Thanks for your help! |
I haven't used cdot yet, so I can't provide complete guidance. The error you received is from seqefetcher, which fetches sequences during normalization. This is distinct from transcript alignments from cdot. It appears that seqfetcher failed because the call was rejected by NCBI. That could be due to a sporadic outage at NCBI, failure of those credentials, too many recent attempts from your IP, local network issues, or something else. See https://hgvs.readthedocs.io/en/stable/installation.html for details. You should think of cdot as replacing uta. Consider installing seqrepo to obviate remote sequence lookups. |
Thank you for your answer @reece !
How does that work? Like, how can I use seqrepo to not lookup sequences remotely? |
Yes - exactly as Reece said - HGVS requires a way to get sequences (SeqRepo) and transcripts (UTA/cdot) You can install them all locally, for SeqRepo copy/paste the 3 command line snippets from instructions here then set |
Hi @davmlaw, I share the same excitement as @reece! It simplifies the DevOps of provisioning and maintaining a relational database inside kubernetes. I tried out the cdot dataprovider for hgvs (https://github.com/SACGF/cdot). The two main gaps I see are missing functionality for getting the predicted amino acid change HGVS nomenclature (critical for clinical reporting) and all possible transcripts for a given variant position. Are you planning to add these to the cdot dataprovider? I am happy to open this up as an issue in your GitHub repo if that helps. Thanks!! |
Hi, I've added the issues to the cdot repo (linked above) please add any further info if you need. I hope to be able to find time within a few weeks and will ping you when done |
Hi @roysomak4 - I've released a new version of cdot (0.2.8) that handles protein identifiers (so c_to_p works) and also all of the possible transcripts for a given position (this only works on local JSON files, not yet on the REST server) If you use local JSON file, you'll need to download the latest files Let me know if there are any issues |
Thanks @davmlaw! I will run my tests on the local JSON file and share the results shortly. |
Hi @davmlaw, the c_to_p function is working fine. I check with one EGFR and one TP53 variant and the correct p. nomenclature is generated for both of them.
I am using GRCh38 Thanks, |
Hi, this was due to an off by 1 error (I assumed var_g.posedit.pos.start.base would be 1 past the end) I've fixed it and upgraded cdot, the data files haven't changed just the library |
Thanks for the quick fix! I can confirm that the transcript function is working now (see log below). In terms of speed, I observed that starting v0.2.8, loading the JSON file take much longer than it did in the prior versions I tested. I am loading
|
Hi, yeah it takes a while to builds the interval tree, but it only does so lazily when you need it. If you don't use that functionality it should be the same. You can't store the interval tree as JSON but you can pickle it, so I may look into that as an optional local cache |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Hi, cdot has been used by a few other people now, and I think I've ironed out the bugs. I might make a pull request with some (commented out) options in the documentation examples talking about data provider options |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
Few things to finish off first but I hope to make a pull request adding cdot to the docs in a week or two Happy to let this close for now |
Hi, thanks for HGVS project, it’s great.
I’ve written a data provider that has 5.5 times as many transcripts as UTA, and a REST service (>10x faster than public UTA and works behind firewalls)
It takes a different approach from UTA - converting existing RefSeq/Ensembl GTFs to JSON.
This allows you to use flat files instead of Postgres, or a trivial (300 line) REST service. It aims at being fast and simple, and doesn't implement all of the provider functionality yet (or maybe ever - ie just transcript/genome conversions)
See project: https://github.com/SACGF/cdot
This is a glue project, and everything is MIT so I’d be happy to merge it into the core HGVS code or become a BioCommons project if you’ll have it - see https://github.com/SACGF/cdot/wiki/Project-goals---directions
Thanks!
The text was updated successfully, but these errors were encountered: