Cobalt issue with read counting in PacBio HiFi reads #485

proteinosome · 2023-10-26T09:31:16Z

Dear developers, I've been testing out HMFTools suite on PacBio HiFi sequencing dataset and it looks like straight out of the box Purple is able to estimate purity and ploidy very well for in-silico titrated cell lines (HCC1395 and COLO829), which is great!

However, upon closer examination the CNV segmentation isn't working very well. When I looked at Cobalt output, it looks like the read counts were pretty off. I'm running this on a 60X tumor/normal dataset but the median read count in 1 kbp bins were only 2 and 4 respectively. May I check with you if you know why Cobalt is counting the reads incorrectly? FWIW The depth output from Amber actually looks good.

Thank you.

p-priestley · 2023-10-26T09:48:37Z

We have never considered longer reads previously. COBALT use 1kb bins, based on the starting base of each read. For long reads, this will not work well as they will be sparse. Better of course would be for COBALT to count the average coverage in each 1kb window. I will look into this change and revert

Amber simply counts the depth at each base so there is no isssue.

proteinosome · 2023-10-26T10:01:39Z

Thank you @p-priestley that sounds good to me! Yeah if that's how the counting is done I could see why now. I think with that change the Purple suite should work really well. It has worked surprisingly well even with the count being far off. In COLO829 the somatic CNV segmentation actually looked good once Purple does a joint segmentation of Cobalt + Amber output.

hongwingl · 2023-11-15T21:52:41Z

Hi this count depth change has been committed to the code. We will test more and make releases soon.

proteinosome · 2023-11-20T02:21:33Z

Thanks @hongwingl. I did a couple of tests and indeed the depths are calculated correctly. However, there are now way too many segments likely due to the relatively high depth fluctuation from 1kbp windows. Setting gamma to 200 I'm getting 4k PCF segments in germline samples.

With CNVKit previously I've had to set the bin size to 10 kbp for long-reads, and I think we're probably running into similar issue here. Not sure if there's any straightforward way to try fixing this, but thanks for working on the depth count issue!

p-priestley · 2023-11-20T03:00:26Z

4000 segments for a germline sample does not sound ridiculous to me. That is inline with roughly what we would expect from the number of deletions and duplicaitons in the germline in a typical sample

proteinosome · 2023-11-20T03:18:23Z

Thanks @p-priestley ! I guess I had been used to seeing hundreds of segments at most in germline from other CNV caller where generally only very large events are called as segments. I'll do more investigation into the segments to see if I can make sense of them.

proteinosome · 2023-11-20T06:13:14Z

Here's a Circos plot of HCC1395(BL) from Purple FYI to show the high RD/BAF fluctuation. Do you think there's probably a need to smooth out the points?

p-priestley · 2023-11-20T06:46:25Z

The input looks very noisy compared to what I am used to, but I have never looked at PacBio data. I am also unfamiliar this cell line. Are you able to post the pictures for COLO829T also as I know that much better.

Is that with gamma set to 200 already? Perhaps you could set it higher (eg. 1000?)

Lastly, what is the depth here roughly? COBALT should consolidate to larger windows automatically if the average depth of coverage <8 ( https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md#depth-window-consoldiation)

proteinosome · 2023-11-21T02:59:55Z

Long-reads have much lesser number of reads compared to short-reads at equal coverage, and I think that might be the reason why the read depths and BAF are relatively noisier with small bin windows.

Here's COLO829 with gamma 200 and gamma 1000:

Both are 30X tumor, 30X normal.

I noticed as well that Purple is calling 100% purity and 1.5 ploidy by default but the solution is just barely better than 99% purity and 3.0 ploidy (which should be the correct ploidy). I wonder if that is due to the noisy nature or is that also seen in short-reads dataset?

p-priestley · 2023-11-21T03:22:06Z

Here is our result from high quality short read sequencing

p-priestley · 2023-11-21T03:23:34Z

Your 1000 result looks ok. Different noise characteristics may lead to different fits. There are various paratmers in PURPLE which can control this if you are really interested.

p-priestley · 2023-11-21T03:24:09Z

The input.png is MUCH cleaner for short read data:

See here:
https://github.com/hartwigmedical/hmftools/tree/master/purple#circos

proteinosome · 2023-11-21T08:45:03Z

Thanks @p-priestley for sharing. I think that's what I expected, too, in terms of the noise from short-reads based on some of my past observations. Perhaps a larger bin windows option for long-reads would be useful to smooth out the noise. That might have an effect in masking out focal CNV, but those should likely be captured by long-reads SV caller, anyway.

I'll stick with gamma=1000 for now.

FYI in case you guys are interested to look into HiFi dataset, both the COLO829 and HCC1395 datasets are available on our website: https://www.pacb.com/connect/datasets/

Thank you.

hongwingl self-assigned this Nov 15, 2023

charlesshale closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cobalt issue with read counting in PacBio HiFi reads #485

Cobalt issue with read counting in PacBio HiFi reads #485

proteinosome commented Oct 26, 2023

p-priestley commented Oct 26, 2023

proteinosome commented Oct 26, 2023

hongwingl commented Nov 15, 2023

proteinosome commented Nov 20, 2023

p-priestley commented Nov 20, 2023

proteinosome commented Nov 20, 2023

proteinosome commented Nov 20, 2023

p-priestley commented Nov 20, 2023

proteinosome commented Nov 21, 2023

p-priestley commented Nov 21, 2023

p-priestley commented Nov 21, 2023

p-priestley commented Nov 21, 2023

proteinosome commented Nov 21, 2023

Cobalt issue with read counting in PacBio HiFi reads #485

Cobalt issue with read counting in PacBio HiFi reads #485

Comments

proteinosome commented Oct 26, 2023

p-priestley commented Oct 26, 2023

proteinosome commented Oct 26, 2023

hongwingl commented Nov 15, 2023

proteinosome commented Nov 20, 2023

p-priestley commented Nov 20, 2023

proteinosome commented Nov 20, 2023

proteinosome commented Nov 20, 2023

p-priestley commented Nov 20, 2023

proteinosome commented Nov 21, 2023

p-priestley commented Nov 21, 2023

p-priestley commented Nov 21, 2023

p-priestley commented Nov 21, 2023

proteinosome commented Nov 21, 2023