Fix handling of missing data in metadata #758
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Fixes an issue with metadata parsing (#757) where:
NaN
valuesfloat
valuesis_date_ambiguous
) break.This PR resolves the issue by restoring Augur's original behavior which was to fill all missing values in a data frame with empty strings. However, we use a slightly different approach here of asking pandas not to parse missing data at all through the
na_filter=False
argument. This argument has the same effect as the previous implementation, but it only needs to be written once to apply for all calls ofread_metadata
, whereas thefillna
approach would need to be applied to data frames or data frame chunks in different contexts.To recreate the original issue, this PR updates the functional tests for augur filter to include a metadata record with a missing date column and updates the other parts of the functional tests accordingly.
After addressing issue #757, the updated metadata in our functional tests revealed a previously hidden bug (from v12.0.0 and prior) where grouping by
year
in augur filter would include strains with missing dates as a separate additional group with a missing year value. The original code used acontinue
statement that was intended to continue to the next strain, but because this statement was inside another for loop, it only continued to the new group and didn't actually skip the problematic strain.Therefore, this PR also fixes the previously hidden issue by assigning an explicit boolean variable that tracks whether a strain should be skipped or not. We assign this variable to
True
when we can't parse a year from the strain's date string, print a clearer warning message to stderr, and break from the loop (instead of continuing). This PR updates the functional tests to reflect this new output to stderr and the highest priority strains that should be included from each group.Related issue(s)
Fixes #757
Testing