Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand the optimisation goals of smoothxg #204

Open
YX-Xiang opened this issue Jan 19, 2024 · 1 comment
Open

How to understand the optimisation goals of smoothxg #204

YX-Xiang opened this issue Jan 19, 2024 · 1 comment

Comments

@YX-Xiang
Copy link

I hope this message finds you well. I am currently exploring smoothxg algorithm. I came across a statement in the paper that raised some questions for me:

"A key issue is that pairwise alignments derived across our input are not mutually normalized, leading to different representations of small variants like indels in low-complexity sequences, which in turn generate complex looping motifs that are difficult to process."

I am particularly interested in understanding what is meant by "complex looping motifs" in this context. Could you provide a simple example or elaborate on the nature of these motifs? I am eager to gain a deeper insight into this aspect of the algorithm.

Thank you for your time and assistance. I appreciate the work you have put into smoothxg, and I look forward to hearing from you.

@ekg
Copy link
Collaborator

ekg commented Jan 19, 2024

You can see directly by saving the output of pggb after each step and comparing. Look at the seqwish graph vs the final graph. There will be motifs in short tandem repeats which become extremely dense and complex. For instance, a single C might represent an entire 20bp homopolymer with many diverse alleles. This isn't necessarily wrong, but it's a representation that can be hard to reason about and work with. It's hard to visualize and doesn't match MSA models that tend to be understood.

Smoothxg's optimization goal is to ensure that the graph has a local partial order, where local is defined as the length parameter given to -G. This defaults to ~1kbp, but you can increase or decrease it with the caveat that very large values are computationally prohibitive because the partial order alignment algorithms we use are quadratic in sequence and graph length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants