Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract transcript sequences from clinker output #15

Open
maximillo opened this issue Mar 12, 2019 · 4 comments
Open

extract transcript sequences from clinker output #15

maximillo opened this issue Mar 12, 2019 · 4 comments
Assignees
Labels

Comments

@maximillo
Copy link

maximillo commented Mar 12, 2019

Hi Breon,

This might not be quite relevant to your development but I'd really appreciate it if I can have some input from you. So, clinker weighs different variants of fusions by associate the number of split reads to a corresponding junction, I wonder is there a way to extract the full transcript sequence of each different "variant" of the fusions from clinker output files. Can you please advise? Thanks a lot!

Max

@breons breons self-assigned this Mar 13, 2019
@breons
Copy link
Contributor

breons commented Mar 13, 2019

Hi Max,

Good one!

I think this should be fairly straight forward. You will notice that in the results/alignment/sample_name folder there are two files, junctions.txt and fusion_locations.bed.

The fusion_locations.bed will give you the coordinates of where the two superTranscripts have been concatenated together, the junctions.txt will give you the read count for each junction found. With this information you should be able to get a list of all fusion variants, i.e. entries that span the fusion boundary would indicate the fusion breakpoint.

Once you have that list, you should be able to just go to reference/fst_references.fasta, locate the genes involved in your fusion and then simply just take the sequences that correspond to the coordinates you found earlier.

An example. If you have a GENE1:GENE2 fusion, where GENE1 and GENE2 are each 1000 bases long and where junctions.txt indicates that there is a junction between 200 and 1400, you can go to fst_reference.fasta and take the first 200 bases of GENE1 and the last 600 bases of GENE2. Concatenate those together and you should have your result.

What you think?
Breon.

@maximillo
Copy link
Author

maximillo commented Mar 13, 2019

Hi Breon,

Thank you so much for your detailed instruction! This definitely explains a lot. But this way all transcript start from 0 of GENE1, "break" at the junction point of GENE1, then continue at the junction point of GENE2, stop at the end of GENE2 -- meaning all transcripts have unanimous start and end? By looking at the result of one of our samples (attached), it's obvious that some transcripts don't start from the very beginning of GENE1 or end at the very end of GENE2. This got me wonder how the start and end of these transcripts (in Transcripts track) were computed or they were simply extracted from existing reference databases. Can you please comment on this? I probably missed some critical link in here. Sorry if this sounds completely dumb :)

On a side note, I tried using pizzly to call the fusions as well. In the results, pizzly gives all possible variants of fusions in the format of, for instance:
ENST00000332149_0:79_ENST00000398905_80:4806
ENST00000332149_0:79_ENST00000398907_80:4809
ENST00000398585_0:116_ENST00000288319_121:4919
The first two share the same junction, vary only in the end position of GENE2. The third is different in both junction and end of GENE2. I don't know exactly how this was done, but this should give a better idea on what I'm looking at now.
TMPRSS2-ERG.pdf

Max

@breons
Copy link
Contributor

breons commented Mar 17, 2019

Hi Max,

Apologies! I misunderstood. Yes certainly the fusion transcripts could start and end at different points as well. Also, the transcript track is a representation of an existing reference database, which has its own visual benefit too.

This is a bit out of scope for Clinker, but let me have a chat to one of my colleagues as I think there may be something that can help you do this.

Cheers,
Breon.

@maximillo
Copy link
Author

Hi Breon,

Sounds great, I really appreciate your effort on helping me out!!!

Max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants