extract transcript sequences from clinker output #15

maximillo · 2019-03-12T18:38:59Z

Hi Breon,

This might not be quite relevant to your development but I'd really appreciate it if I can have some input from you. So, clinker weighs different variants of fusions by associate the number of split reads to a corresponding junction, I wonder is there a way to extract the full transcript sequence of each different "variant" of the fusions from clinker output files. Can you please advise? Thanks a lot!

Max

breons · 2019-03-13T10:52:35Z

Hi Max,

Good one!

I think this should be fairly straight forward. You will notice that in the results/alignment/sample_name folder there are two files, junctions.txt and fusion_locations.bed.

The fusion_locations.bed will give you the coordinates of where the two superTranscripts have been concatenated together, the junctions.txt will give you the read count for each junction found. With this information you should be able to get a list of all fusion variants, i.e. entries that span the fusion boundary would indicate the fusion breakpoint.

Once you have that list, you should be able to just go to reference/fst_references.fasta, locate the genes involved in your fusion and then simply just take the sequences that correspond to the coordinates you found earlier.

An example. If you have a GENE1:GENE2 fusion, where GENE1 and GENE2 are each 1000 bases long and where junctions.txt indicates that there is a junction between 200 and 1400, you can go to fst_reference.fasta and take the first 200 bases of GENE1 and the last 600 bases of GENE2. Concatenate those together and you should have your result.

What you think?
Breon.

maximillo · 2019-03-13T21:30:23Z

Hi Breon,

Thank you so much for your detailed instruction! This definitely explains a lot. But this way all transcript start from 0 of GENE1, "break" at the junction point of GENE1, then continue at the junction point of GENE2, stop at the end of GENE2 -- meaning all transcripts have unanimous start and end? By looking at the result of one of our samples (attached), it's obvious that some transcripts don't start from the very beginning of GENE1 or end at the very end of GENE2. This got me wonder how the start and end of these transcripts (in Transcripts track) were computed or they were simply extracted from existing reference databases. Can you please comment on this? I probably missed some critical link in here. Sorry if this sounds completely dumb :)

On a side note, I tried using pizzly to call the fusions as well. In the results, pizzly gives all possible variants of fusions in the format of, for instance:
ENST00000332149_0:79_ENST00000398905_80:4806
ENST00000332149_0:79_ENST00000398907_80:4809
ENST00000398585_0:116_ENST00000288319_121:4919
The first two share the same junction, vary only in the end position of GENE2. The third is different in both junction and end of GENE2. I don't know exactly how this was done, but this should give a better idea on what I'm looking at now.
TMPRSS2-ERG.pdf

Max

breons · 2019-03-17T23:28:07Z

Hi Max,

Apologies! I misunderstood. Yes certainly the fusion transcripts could start and end at different points as well. Also, the transcript track is a representation of an existing reference database, which has its own visual benefit too.

This is a bit out of scope for Clinker, but let me have a chat to one of my colleagues as I think there may be something that can help you do this.

Cheers,
Breon.

maximillo · 2019-03-18T17:07:54Z

Hi Breon,

Sounds great, I really appreciate your effort on helping me out!!!

Max

breons added the question label Mar 13, 2019

breons self-assigned this Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract transcript sequences from clinker output #15

extract transcript sequences from clinker output #15

maximillo commented Mar 12, 2019 •

edited

Loading

breons commented Mar 13, 2019

maximillo commented Mar 13, 2019 •

edited

Loading

breons commented Mar 17, 2019

maximillo commented Mar 18, 2019

extract transcript sequences from clinker output #15

extract transcript sequences from clinker output #15

Comments

maximillo commented Mar 12, 2019 • edited Loading

breons commented Mar 13, 2019

maximillo commented Mar 13, 2019 • edited Loading

breons commented Mar 17, 2019

maximillo commented Mar 18, 2019

maximillo commented Mar 12, 2019 •

edited

Loading

maximillo commented Mar 13, 2019 •

edited

Loading