You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context
One common pattern for Nextstrain analyses is the "latest" build for a given pathogen where strains in the tree were sampled from the last N months or years. These builds may explicitly encode the recent timespan they represent (as in the "resolutions" of seasonal flu builds like the 2y H3N2 build or the "4m" build for SARS-CoV-2 in Washington State).
Other analyses implicitly encode their dependence on date offsets in the subsampling rules used to select sequences as in the Nextstrain SARS-CoV-2 regional builds that define _early and _late subsampling schemes whose names are parsed by custom Snakemake rules.
All of these builds set a maximum date of "today" and dynamically define a specific minimum date based on some desired date offset.
Description
To allow users/workflows to filter data based on relative dates as in the builds above, Augur's filter command should support date offsets for its min/max date thresholds.
Possible solution
Date offsets could be implemented by overloading the existing interface such that --min-date and --max-date could support a specific date offset syntax as an alternative to the current float/ISO-8601 values they currently accept. Alternately, we could add new arguments specifically for the offsets (e.g., --min-date-offset and --max-date-offset).
We probably do not need to include the negative operator in the offset syntax, since these offsets will always be relative to the current date. However, one could imagine allowing a combination of --max-date and --max-date-offset, for example, to allow users to override that upper limit.
When the user provides date offsets, Augur should report the dates it calculated and used for the filtering as part of the filter report written to standard out.
Examples
# Select strains collected no earlier than 2 months ago and no later than 1 day ago (for SARS-CoV-2 analysis).
augur filter --min-date "2M" --max-date "1D"
# Select strains collected no earlier than 2 years ago (for influenza A/H3N2 analysis).
augur filter --min-date "2Y"
The text was updated successfully, but these errors were encountered:
Another example use case based on actual SARS-CoV-2 analysis parameters includes setting hard lower/upper limits on min/max date to exclude records with impossible collection dates (dates prior to the first known case of SARS-CoV-2, for example, or dates in the future). For example, we currently hardcode the default min date to 2019.74 and the max date to today.
If we did not provide an explicit value for max date, the max date could be in the future. To avoid this issue with the syntax proposed above, we could set --max-date "0D" to get zero days in the past from today. This is kind of strange syntax though. Alternately, we could support a subset of human-readable text values (in addition to the ISO 8601 offsets) to get arguments like --max-date today or --max-date yesterday. Another stricter option would be to always exclude records with dates in the future.
Context
One common pattern for Nextstrain analyses is the "latest" build for a given pathogen where strains in the tree were sampled from the last N months or years. These builds may explicitly encode the recent timespan they represent (as in the "resolutions" of seasonal flu builds like the 2y H3N2 build or the "4m" build for SARS-CoV-2 in Washington State).
Other analyses implicitly encode their dependence on date offsets in the subsampling rules used to select sequences as in the Nextstrain SARS-CoV-2 regional builds that define
_early
and_late
subsampling schemes whose names are parsed by custom Snakemake rules.All of these builds set a maximum date of "today" and dynamically define a specific minimum date based on some desired date offset.
Description
To allow users/workflows to filter data based on relative dates as in the builds above, Augur's filter command should support date offsets for its min/max date thresholds.
Possible solution
Date offsets could be implemented by overloading the existing interface such that
--min-date
and--max-date
could support a specific date offset syntax as an alternative to the current float/ISO-8601 values they currently accept. Alternately, we could add new arguments specifically for the offsets (e.g.,--min-date-offset
and--max-date-offset
).We should consider implementing date offsets using the ISO-8601 syntax for durations. Internally, we could use pandas's time delta interface which supports parsing ISO-8601 durations and mathematical operations between datetime objects and delta objects. (Also, pandas is already widely used in Augur.)
We probably do not need to include the negative operator in the offset syntax, since these offsets will always be relative to the current date. However, one could imagine allowing a combination of
--max-date
and--max-date-offset
, for example, to allow users to override that upper limit.When the user provides date offsets, Augur should report the dates it calculated and used for the filtering as part of the filter report written to standard out.
Examples
The text was updated successfully, but these errors were encountered: