Fix handling of missing data in metadata #758

huddlej · 2021-08-13T20:46:02Z

Description of proposed changes

Fixes an issue with metadata parsing (#757) where:

pandas treats missing data (e.g., empty strings) as NaN values
missing date values (which appear in some GenBank SARS-CoV-2 records) get interpreted as float values
and downstream functions that expect strings (like is_date_ambiguous) break.

This PR resolves the issue by restoring Augur's original behavior which was to fill all missing values in a data frame with empty strings. However, we use a slightly different approach here of asking pandas not to parse missing data at all through the na_filter=False argument. This argument has the same effect as the previous implementation, but it only needs to be written once to apply for all calls of read_metadata, whereas the fillna approach would need to be applied to data frames or data frame chunks in different contexts.

To recreate the original issue, this PR updates the functional tests for augur filter to include a metadata record with a missing date column and updates the other parts of the functional tests accordingly.

After addressing issue #757, the updated metadata in our functional tests revealed a previously hidden bug (from v12.0.0 and prior) where grouping by year in augur filter would include strains with missing dates as a separate additional group with a missing year value. The original code used a continue statement that was intended to continue to the next strain, but because this statement was inside another for loop, it only continued to the new group and didn't actually skip the problematic strain.

Therefore, this PR also fixes the previously hidden issue by assigning an explicit boolean variable that tracks whether a strain should be skipped or not. We assign this variable to True when we can't parse a year from the strain's date string, print a clearer warning message to stderr, and break from the loop (instead of continuing). This PR updates the functional tests to reflect this new output to stderr and the highest priority strains that should be included from each group.

Related issue(s)

Fixes #757

Testing

Adds functional tests to catch the original issues and CI confirms that these pass

Fixes an issue with metadata parsing where pandas treats missing data (e.g., empty strings) as `NaN` values, missing date values (which appear in some GenBank SARS-CoV-2 records) get interpretted as `float` values, and downstream functions that expect strings (like `is_date_ambiguous`) break. This commit resolves the issue by restoring Augur's original behavior which was to fill all missing values in a data frame with empty strings [1]. However, we use a slightly different approach here of asking pandas not to parse missing data at all through the `na_filter=False` argument. This argument has the same effect as the previous implementation, but it only needs to be written once to apply for all calls of `read_metadata`, whereas the `fillna` approach would need to be applied to data frames or data frame chunks in different contexts. To recreate the original issue, this commit updates the functional tests for augur filter to include a metadata record with a missing date column and updates the other parts of the functional tests accordingly. [1] https://github.com/nextstrain/augur/blob/15dfc8b/augur/util_support/metadata_file.py#L97

Fixes a previously hidden bug where grouping by `year` in augur filter would include strains with missing dates as a separate additional group with a missing year value. The original code used a `continue` statement that was intended to continue to the next strain, but because this statement was inside another for loop, it only continued to the new group. This commit fixes the issue by assigning an explicit boolean variable that tracks whether a strain should be skipped or not. We assign this variable to `True` when we can't parse a year from the strain's date string, print a clearer warning message to stderr, and break from the loop (instead of continuing). This commit includes updates to the functional tests (that originally caught this previously hidden bug!) to reflect this new output to stderr and the highest priority strains that should be included from each group.

codecov · 2021-08-13T20:47:31Z

Codecov Report

Merging #758 (beb93cd) into master (eb89c09) will decrease coverage by 0.00%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
- Coverage   33.55%   33.55%   -0.01%     
==========================================
  Files          41       41              
  Lines        5868     5871       +3     
  Branches     1418     1419       +1     
==========================================
+ Hits         1969     1970       +1     
- Misses       3815     3816       +1     
- Partials       84       85       +1

Impacted Files	Coverage Δ
augur/io.py	`100.00% <ø> (ø)`
augur/filter.py	`67.52% <33.33%> (-0.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb89c09...beb93cd. Read the comment docs.

huddlej added 2 commits August 13, 2021 13:24

huddlej added this to the Patch release 12.1.1 milestone Aug 13, 2021

huddlej merged commit ae80716 into master Aug 13, 2021

huddlej deleted the fix-filter-date-parsing branch August 13, 2021 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of missing data in metadata #758

Fix handling of missing data in metadata #758

huddlej commented Aug 13, 2021

codecov bot commented Aug 13, 2021 •

edited

Loading

Fix handling of missing data in metadata #758

Fix handling of missing data in metadata #758

Conversation

huddlej commented Aug 13, 2021

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Aug 13, 2021 • edited Loading

Codecov Report

codecov bot commented Aug 13, 2021 •

edited

Loading