Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cobalt issue with read counting in PacBio HiFi reads #485

Closed
proteinosome opened this issue Oct 26, 2023 · 13 comments
Closed

Cobalt issue with read counting in PacBio HiFi reads #485

proteinosome opened this issue Oct 26, 2023 · 13 comments
Assignees

Comments

@proteinosome
Copy link

Dear developers, I've been testing out HMFTools suite on PacBio HiFi sequencing dataset and it looks like straight out of the box Purple is able to estimate purity and ploidy very well for in-silico titrated cell lines (HCC1395 and COLO829), which is great!

However, upon closer examination the CNV segmentation isn't working very well. When I looked at Cobalt output, it looks like the read counts were pretty off. I'm running this on a 60X tumor/normal dataset but the median read count in 1 kbp bins were only 2 and 4 respectively. May I check with you if you know why Cobalt is counting the reads incorrectly? FWIW The depth output from Amber actually looks good.

Thank you.

@p-priestley
Copy link
Contributor

We have never considered longer reads previously. COBALT use 1kb bins, based on the starting base of each read. For long reads, this will not work well as they will be sparse. Better of course would be for COBALT to count the average coverage in each 1kb window. I will look into this change and revert

Amber simply counts the depth at each base so there is no isssue.

@proteinosome
Copy link
Author

Thank you @p-priestley that sounds good to me! Yeah if that's how the counting is done I could see why now. I think with that change the Purple suite should work really well. It has worked surprisingly well even with the count being far off. In COLO829 the somatic CNV segmentation actually looked good once Purple does a joint segmentation of Cobalt + Amber output.

@hongwingl
Copy link
Contributor

Hi this count depth change has been committed to the code. We will test more and make releases soon.

@hongwingl hongwingl self-assigned this Nov 15, 2023
@proteinosome
Copy link
Author

Thanks @hongwingl. I did a couple of tests and indeed the depths are calculated correctly. However, there are now way too many segments likely due to the relatively high depth fluctuation from 1kbp windows. Setting gamma to 200 I'm getting 4k PCF segments in germline samples.

With CNVKit previously I've had to set the bin size to 10 kbp for long-reads, and I think we're probably running into similar issue here. Not sure if there's any straightforward way to try fixing this, but thanks for working on the depth count issue!

@p-priestley
Copy link
Contributor

4000 segments for a germline sample does not sound ridiculous to me. That is inline with roughly what we would expect from the number of deletions and duplicaitons in the germline in a typical sample

@proteinosome
Copy link
Author

Thanks @p-priestley ! I guess I had been used to seeing hundreds of segments at most in germline from other CNV caller where generally only very large events are called as segments. I'll do more investigation into the segments to see if I can make sense of them.

@proteinosome
Copy link
Author

Here's a Circos plot of HCC1395(BL) from Purple FYI to show the high RD/BAF fluctuation. Do you think there's probably a need to smooth out the points?

image
image

@p-priestley
Copy link
Contributor

The input looks very noisy compared to what I am used to, but I have never looked at PacBio data. I am also unfamiliar this cell line. Are you able to post the pictures for COLO829T also as I know that much better.

Is that with gamma set to 200 already? Perhaps you could set it higher (eg. 1000?)

Lastly, what is the depth here roughly? COBALT should consolidate to larger windows automatically if the average depth of coverage <8 ( https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md#depth-window-consoldiation)

@proteinosome
Copy link
Author

Long-reads have much lesser number of reads compared to short-reads at equal coverage, and I think that might be the reason why the read depths and BAF are relatively noisier with small bin windows.

Here's COLO829 with gamma 200 and gamma 1000:

COLO829 gamma200 circos

COLO829 gamma1000 circos

Both are 30X tumor, 30X normal.

I noticed as well that Purple is calling 100% purity and 1.5 ploidy by default but the solution is just barely better than 99% purity and 3.0 ploidy (which should be the correct ploidy). I wonder if that is due to the noisy nature or is that also seen in short-reads dataset?

@p-priestley
Copy link
Contributor

Here is our result from high quality short read sequencing
image

@p-priestley
Copy link
Contributor

Your 1000 result looks ok. Different noise characteristics may lead to different fits. There are various paratmers in PURPLE which can control this if you are really interested.

@p-priestley
Copy link
Contributor

The input.png is MUCH cleaner for short read data:

See here:
https://github.com/hartwigmedical/hmftools/tree/master/purple#circos

@proteinosome
Copy link
Author

Thanks @p-priestley for sharing. I think that's what I expected, too, in terms of the noise from short-reads based on some of my past observations. Perhaps a larger bin windows option for long-reads would be useful to smooth out the noise. That might have an effect in masking out focal CNV, but those should likely be captured by long-reads SV caller, anyway.

I'll stick with gamma=1000 for now.

FYI in case you guys are interested to look into HiFi dataset, both the COLO829 and HCC1395 datasets are available on our website: https://www.pacb.com/connect/datasets/

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants