Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Support date offsets for min/max date thresholds #721

Closed
huddlej opened this issue May 10, 2021 · 2 comments · Fixed by #740
Closed

filter: Support date offsets for min/max date thresholds #721

huddlej opened this issue May 10, 2021 · 2 comments · Fixed by #740
Assignees
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented May 10, 2021

Context
One common pattern for Nextstrain analyses is the "latest" build for a given pathogen where strains in the tree were sampled from the last N months or years. These builds may explicitly encode the recent timespan they represent (as in the "resolutions" of seasonal flu builds like the 2y H3N2 build or the "4m" build for SARS-CoV-2 in Washington State).

Other analyses implicitly encode their dependence on date offsets in the subsampling rules used to select sequences as in the Nextstrain SARS-CoV-2 regional builds that define _early and _late subsampling schemes whose names are parsed by custom Snakemake rules.

All of these builds set a maximum date of "today" and dynamically define a specific minimum date based on some desired date offset.

Description
To allow users/workflows to filter data based on relative dates as in the builds above, Augur's filter command should support date offsets for its min/max date thresholds.

Possible solution

Date offsets could be implemented by overloading the existing interface such that --min-date and --max-date could support a specific date offset syntax as an alternative to the current float/ISO-8601 values they currently accept. Alternately, we could add new arguments specifically for the offsets (e.g., --min-date-offset and --max-date-offset).

We should consider implementing date offsets using the ISO-8601 syntax for durations. Internally, we could use pandas's time delta interface which supports parsing ISO-8601 durations and mathematical operations between datetime objects and delta objects. (Also, pandas is already widely used in Augur.)

We probably do not need to include the negative operator in the offset syntax, since these offsets will always be relative to the current date. However, one could imagine allowing a combination of --max-date and --max-date-offset, for example, to allow users to override that upper limit.

When the user provides date offsets, Augur should report the dates it calculated and used for the filtering as part of the filter report written to standard out.

Examples

# Select strains collected no earlier than 2 months ago and no later than 1 day ago (for SARS-CoV-2 analysis).
augur filter --min-date "2M" --max-date "1D"

# Select strains collected no earlier than 2 years ago (for influenza A/H3N2 analysis).
augur filter --min-date "2Y"
@huddlej huddlej added the enhancement New feature or request label May 10, 2021
@huddlej
Copy link
Contributor Author

huddlej commented May 10, 2021

Another example use case based on actual SARS-CoV-2 analysis parameters includes setting hard lower/upper limits on min/max date to exclude records with impossible collection dates (dates prior to the first known case of SARS-CoV-2, for example, or dates in the future). For example, we currently hardcode the default min date to 2019.74 and the max date to today.

If we did not provide an explicit value for max date, the max date could be in the future. To avoid this issue with the syntax proposed above, we could set --max-date "0D" to get zero days in the past from today. This is kind of strange syntax though. Alternately, we could support a subset of human-readable text values (in addition to the ISO 8601 offsets) to get arguments like --max-date today or --max-date yesterday. Another stricter option would be to always exclude records with dates in the future.

@huddlej
Copy link
Contributor Author

huddlej commented Jun 18, 2021

Based on work in nextstrain/ncov#659, we should consider adding these same offsets to augur frequencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants