Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving parallelism for medaka_consensus #350

Open
jjkoehorst opened this issue Jan 27, 2022 · 6 comments
Open

Improving parallelism for medaka_consensus #350

jjkoehorst opened this issue Jan 27, 2022 · 6 comments

Comments

@jjkoehorst
Copy link

Is your feature request related to a problem? Please describe.
We are working on some large datasets 20.000 > 100.000 + contigs and realised that one of the bottlenecks in our workflow is related to medaka as it is running mostly on a single thread.

Describe the solution you'd like
A simple approach using the medaka_consensus method in which the files are splitted as mentioned in https://github.com/nanoporetech/medaka#improving-parallelism

Describe alternatives you've considered
I am currently not aware of an already existing solution. As my bash skills are not that great I am planning to create a medaka_consensus.py script that will perform the steps as mentioned in https://github.com/nanoporetech/medaka#improving-parallelism in a more default logic parallel approach. Is there an interest from your end to eventually pick this up and incorporate this script into medaka? or does something like that already exists?

@cjw85
Copy link
Member

cjw85 commented Jan 27, 2022

We do not have generic automated scripting that implements what is noted in the README. It wouldn't be too difficult to add to the existing medaka_consensus script in a way that solves the problem for 80% of use cases.

The required additions would be to:
a) create lists of contigs to be processed in batches by medaka consensus
b) use gnu parallel (or similar) to process these batches
c) pass all outputs to medaka stitch

This could be more refined (there's lots of questions around optimal batching), but should be sufficient for a lot of cases.

By the way, there is an outstanding issue that artifically makes medaka slow at processing lots of short contigs: the program reopens the input BAM file (and index) for every region of data it tries to process. This is a fixed overhead and can be slow for large BAMs. I mention this because it might me you observed less than linear time scaling with the number of compute processes. Its on my TODO list to fix this issue

@jjkoehorst
Copy link
Author

Thanks for the update! It would be great when there is a solution for 80% of the cases and will look forward to the update on the region issue. I will make a script for now then but would be great if in the future your solution would be available.

@jjkoehorst
Copy link
Author

I have a first draft working but I was curious if there is a trade-off between the number of contigs or the total size (besides the terminal command length limit). I was thinking of making chunks of 100 contigs each or up to a maximum size. Not sure yet about the numbers. All will be command line arguments.

@cjw85
Copy link
Member

cjw85 commented Jan 28, 2022

The core algorithm of medaka is applied to overlapping chunks of base counts matrices derived from read pileups (much like samtools tview would show). These chunks are stacked to form N x M x D matrices; N number of chunks (--batch_size, 100), M the width of the chunk (--chunk_len, 10000), D the number of distinct base counts we make (a fixed constant, 10). We ordinarily stack the chunks across contigs, so N x M X D matrices may contain data from multiple contigs.

However, and this is where my earlier point about short contigs comes in. If an individual counts matrix is narrower than the fixed M used by the program; it is put to one side for independent processing. We don't try to pad and stack these small matrices for parallel processing as we found doing so affected accuracy.

So these things together, if your contigs are long (10s of times longer than M) the number of contigs given to each parallel incarnation of medaka consensus will not matter --- theres a sufficient amount of data for efficient compute; and you should think of batching by total length. However, if you have lots of small contigs there will be no internal parallelisation of compute within the program and it may be better to create additional parallel processes based on the number of contigs.

I'm going to look at making optimisations around the handling of short contig pieces (through reusing BAM file handles), to help mitigate the above.

I'm intrigued, what assemblies do you have with 100s of contigs; good human assemblies do not typically surpass 1000 contigs, many of those reflecting homologous repeat elements which are tricky to obtain consensus for.

@jjkoehorst
Copy link
Author

Due to the command length limitations I can do a limit by contig size (default 10M?) or when it reaches a certain number of contigs (100?, then you reach around 2.000 chars) have to check what the terminal limits are in general.

I had an assembly with ±120.000 contigs as this is a combination of host (non-human) and microbiome data and we will do further deep sequencing in the future of mixed environments where we also expect many contigs due to the vast number of species present.

@jjkoehorst
Copy link
Author

First draft is available at https://git.wur.nl/unlock/scripts/-/blob/master/medaka/medaka.py not many options from the individual tools are available but the key ones are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants