Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of missing data in metadata #758

Merged
merged 2 commits into from
Aug 13, 2021
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Aug 13, 2021

Description of proposed changes

Fixes an issue with metadata parsing (#757) where:

  1. pandas treats missing data (e.g., empty strings) as NaN values
  2. missing date values (which appear in some GenBank SARS-CoV-2 records) get interpreted as float values
  3. and downstream functions that expect strings (like is_date_ambiguous) break.

This PR resolves the issue by restoring Augur's original behavior which was to fill all missing values in a data frame with empty strings. However, we use a slightly different approach here of asking pandas not to parse missing data at all through the na_filter=False argument. This argument has the same effect as the previous implementation, but it only needs to be written once to apply for all calls of read_metadata, whereas the fillna approach would need to be applied to data frames or data frame chunks in different contexts.

To recreate the original issue, this PR updates the functional tests for augur filter to include a metadata record with a missing date column and updates the other parts of the functional tests accordingly.

After addressing issue #757, the updated metadata in our functional tests revealed a previously hidden bug (from v12.0.0 and prior) where grouping by year in augur filter would include strains with missing dates as a separate additional group with a missing year value. The original code used a continue statement that was intended to continue to the next strain, but because this statement was inside another for loop, it only continued to the new group and didn't actually skip the problematic strain.

Therefore, this PR also fixes the previously hidden issue by assigning an explicit boolean variable that tracks whether a strain should be skipped or not. We assign this variable to True when we can't parse a year from the strain's date string, print a clearer warning message to stderr, and break from the loop (instead of continuing). This PR updates the functional tests to reflect this new output to stderr and the highest priority strains that should be included from each group.

Related issue(s)

Fixes #757

Testing

  • Adds functional tests to catch the original issues and CI confirms that these pass

Fixes an issue with metadata parsing where pandas treats missing
data (e.g., empty strings) as `NaN` values, missing date values (which
appear in some GenBank SARS-CoV-2 records) get interpretted as `float`
values, and downstream functions that expect strings (like
`is_date_ambiguous`) break. This commit resolves the issue by restoring
Augur's original behavior which was to fill all missing values in a data
frame with empty strings [1]. However, we use a slightly different
approach here of asking pandas not to parse missing data at all through
the `na_filter=False` argument. This argument has the same effect as the
previous implementation, but it only needs to be written once to apply
for all calls of `read_metadata`, whereas the `fillna` approach would
need to be applied to data frames or data frame chunks in different
contexts.

To recreate the original issue, this commit updates the functional tests
for augur filter to include a metadata record with a missing date
column and updates the other parts of the functional tests accordingly.

[1] https://github.com/nextstrain/augur/blob/15dfc8b/augur/util_support/metadata_file.py#L97
Fixes a previously hidden bug where grouping by `year` in augur filter
would include strains with missing dates as a separate additional group
with a missing year value. The original code used a `continue` statement
that was intended to continue to the next strain, but because this
statement was inside another for loop, it only continued to the new
group.

This commit fixes the issue by assigning an explicit boolean variable
that tracks whether a strain should be skipped or not. We assign this
variable to `True` when we can't parse a year from the strain's date
string, print a clearer warning message to stderr, and break from the
loop (instead of continuing).

This commit includes updates to the functional tests (that originally
caught this previously hidden bug!) to reflect this new output to stderr
and the highest priority strains that should be included from each
group.
@huddlej huddlej added this to the Patch release 12.1.1 milestone Aug 13, 2021
@codecov
Copy link

codecov bot commented Aug 13, 2021

Codecov Report

Merging #758 (beb93cd) into master (eb89c09) will decrease coverage by 0.00%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
- Coverage   33.55%   33.55%   -0.01%     
==========================================
  Files          41       41              
  Lines        5868     5871       +3     
  Branches     1418     1419       +1     
==========================================
+ Hits         1969     1970       +1     
- Misses       3815     3816       +1     
- Partials       84       85       +1     
Impacted Files Coverage Δ
augur/io.py 100.00% <ø> (ø)
augur/filter.py 67.52% <33.33%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb89c09...beb93cd. Read the comment docs.

@huddlej huddlej merged commit ae80716 into master Aug 13, 2021
@huddlej huddlej deleted the fix-filter-date-parsing branch August 13, 2021 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter: Exclude by ambiguous date fails on float input
1 participant