-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FASTA handling to Augur Mask #493
Add FASTA handling to Augur Mask #493
Conversation
Also, if anyone’s interested in chatting about this in real-time, happy to hop on whatever medium y’all prefer |
I'm checking this out now and may need until tomorrow afternoon to have helpful feedback. Chatting about this in real-time could be helpful then, if you're around, @danielsoneg. Thank you for continuing to take on these tricky issues! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first pass, this makes sense and looks good. The main issue right now is the way BED files are read in assuming a header line that should not exist. I'm also a little concerned that some of the original BED coordinate logic might be incorrectly including extra positions. I'll need to spend a little time playing with this code with real data to figure out whether that's really the case.
augur/mask.py
Outdated
sitesToMask = [] | ||
Second column is chromStart, 3rd is chromEnd. Generate a range from these two columns. | ||
""" | ||
sites_to_mask = [] | ||
bed = pd.read_csv(mask_file, sep='\t') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic BED format doesn't support an explicit header line, so this line should actually include a header=None
argument to avoid skipping the first line. This is a very old bug as you can see from the mask sites for TB that repeat the first site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Yeah, makes sense. Per the comment above, looks like the same bug's in tree
, happy to fix it when we merge the two
tests/test_mask.py
Outdated
TEST_BED_SEQUENCE = [1,2,4,6,7,8,9,10] | ||
# IF YOU UPDATE ONE OF THESE, UPDATE THE OTHER. | ||
TEST_BED="""\ | ||
Chrom ChromStart ChromEnd locus tag Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above for the read_bed_file
function about not supporting an explicit header in BED files.
Hey @huddlej, thanks for the review! |
@danielsoneg I'm ok with fixing the BED bug now and doing all of the other changes in separate smaller PRs. Since this PR adds tests with an example BED file, those tests should drop the current header line and then we can add the I sent a Zoom meeting link to your Twitter DMs and will be available there until 3:30, if you want to chat. |
So the BED file in the TB build test has a header (which, in retrospect, was the one I built my example off) - so I've tweaked this a bit in that Pandas is now ignoring the header line, but also won't vomit if it gets a line that isn't all integers. Happy to just scrap that and update the test file if that's all a big mistake. |
Codecov Report
@@ Coverage Diff @@
## master #493 +/- ##
=========================================
Coverage ? 18.65%
=========================================
Files ? 31
Lines ? 5061
Branches ? 1282
=========================================
Hits ? 944
Misses ? 4090
Partials ? 27 Continue to review full report at Codecov.
|
Description of proposed changes
Add FASTA handling to
augur mask
and two additional functions toaugur.utils
to deduplicate some code.Note this is Just the FASTA handling for
augur mask
- I'll add the nonsense site handling and the rest of the capabilities from ncov'smask-alignment.py
next.Related issue(s)
Fixes #148
Testing
Ran the full test suite as well as wrote a lot of additional tests for augur.mask
However, as usual, please a) sanity check and b) actually test this with live files. I've got the files in the tests/build directory to work with, but that's about it.
Notes
This is a pretty big overhaul because
augur.mask
was written solely to handle .vcf files before, so I had to refactor most of the file before I could even start. There's no other new functionality except a debug flag - this is just adding masking to FASTA files.The VCF-specific parts have been moved to their own functions, and I've ported the FASTA masking from
augur.tree
in. I added tests for this file, as well as a couple utility functions that I've already had to rewrite half a dozen times in various files.After this is accepted, I'd like to add the nonsense site masking from #72 as well as the
--mask-from-beginning
and--mask-from-end
functionality from the ncovmask-alignment
script, but I wanted to get the basic rework done first.There's also a bunch of parts of this around error handling and validating inputs I'd like to change as well, but again, I wanted to get the basic refactor in first. This doesn't introduce any regressions and is already a large enough PR.
Thank you for contributing to Nextstrain!