-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix parsing of priorities tsv file to allow spaces in sequence IDs #668
fix parsing of priorities tsv file to allow spaces in sequence IDs #668
Conversation
When parsing priorities.tsv files, an error could be encountered if a sequence name contain spaces, as `.split()` breaks not only on the tabs delimiting the file, but any whitespace character. This repalces `.split()` with `.split('\t')` to address this and preserve sequence names containg spaces.
Thanks, @tomkinsc! I agree it's an oversight that I wonder if we can make this parsing a bit more backwards compat, while still making it more robust, by falling back to |
@tsibley historically, did any old versions of priorities.py write out space-delimited files (then yes, this would be a concern), or are you only referring to test input files (maybe backwards compatibility is less of an issue then?)? |
@dpark01 VCS history shows that |
…resent in the line in filter.py::read_priority_scores(), only split by tab if a tab is present in the line, to allow backward compatibility with space-delimited files
Great point about backward compatibility, @tsibley. Just added a conditional so it splits on tabs if they're present in a line, and otherwise does the generic whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, @tomkinsc! Would you mind adding a test to tests/test_filter.py
?
I'm not super pleased with the way test data get mocked up in that test module, but here is an example that works for me locally with your proposed bug fix:
@pytest.fixture
def mock_priorities_file_valid_with_spaces_and_tabs(mocker):
mocker.patch(
"builtins.open", mocker.mock_open(read_data="strain 1\t5\nstrain 2\t6\nstrain 3\t8\n")
)
class TestFilter:
def test_read_priority_scores_valid_with_spaces_and_tabs(self, mock_priorities_file_valid_with_spaces_and_tabs):
# builtins.open is stubbed, but we need a valid file to satisfy the existence check
priorities = augur.filter.read_priority_scores(
"tests/builds/tb/data/lee_2015.vcf"
)
assert priorities == {"strain 1": 5, "strain 2": 6, "strain 3": 8}
Thank you for writing up a fix for this bug!
add unit test for parsing tab-delimited priorities file, where column 1 has spaces in the values. Thanks to @huddlej for providing this.
@huddlej added that test, thanks! |
Description of proposed changes
When parsing priority tsv files, an error could be encountered if a sequence name contain spaces, as
.split()
breaks not only on the tabs delimiting the file, but any whitespace characters. This replaces.split()
with.split('\t')
to address this and preserve sequence names containing spaces.Related issue(s)
Flagged by @dpark01 as a problem.
A priorities tsv file of the following form:
Leads to the following failure:
Testing
What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?
No changes made to the tests (so far).