-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/probabilistic sampling #629
Conversation
…rom categories to achieve the desired max-sequences on average
a9f5d1c
to
7143228
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR relates to a couple of bigger issues for Augur and ncov builds, so I’ve tried to summarize the context of the PR as I understand it and also provide some alternate approaches for the ncov builds specifically. I also included some thoughts about the specific implementation.
Context and summary
The context for this feature was that we were trying to subsample early North American strains in ncov builds grouping by division, year, and month and only allowing 300 max sequences. There are more combinations of divisions, years, and months (>400) than the number of sequences allowed, so the current subsampling algorithm fails because it cannot pick fewer than 1 sequence per group.
This PR’s feature allows users to specify a maximum number of sequences that is less than the total number of groups by allowing fractional sequences per group and sampling from a Poisson distribution with a mean of that fraction. This approach allows random groups to have zero strains selected in the subsampling with the average number of strains sampled equal to the max sequences allowed.
Thoughts about subsampling
Our current subsampling approach treats each group (e.g., combination of division, year, and month) as equally important, discrete bins, but not all of these bins are equally represented and not all may be equally important.
For example, of the 415+ division/year/month groups, most bins do not have data. We can see this in a punch card plot of the number of samples per year/month and division (a subset of this bigger image is included below). When we group by division, year, and month, we try to sample evenly from all of these bins even though many are nearly empty (e.g., Sinaloa, Montana, Alabama). In this PR, the probabilistic subsampling is as likely to drop these nearly empty bins as it is to drop the larger bins (Washington, California, etc.).
Since geographic representation is already so biased, what if we preferred sampling evenly over time instead of time and space? This would mean setting the group_by
field to just year and month. Then subsampling would randomly select at most 300 sequences across each month with a higher probability of sampling from geographic regions that are already well represented.
We used a similar approach for the SPHERES USA builds. To minimize biased geographic sampling, we also prioritized contextual samples based on their genetic similarity to the focal set. In this way, samples from poorly represented states that are similar to the focal set would get included.
I’m proposing this change to the subsampling approach as an alternative to the feature in this PR, but that doesn’t mean we won’t eventually need something like this feature.
Thoughts about the implementation of probabilistic sampling
The code in this implementation works for me and, (after staring at it for a little bit), it makes a lot of sense. This seems like a great solution to the problem. I did notice that most important lines of _calculate_sequences_per_group
now use one-line boolean tests for allow_fractional
. This constant checking for the type of output makes the intermediate states of the code harder to track in my head.
The _calculate_sequences_per_group
function now also sometimes returns a float and sometimes an integer (but the type hinting says it will always be a float). It might be easier to understand this code if we write it as a separate function for the fractional calculation and call that as needed from filter.py
.
If we feel like this type of functionality is exceeding the domain of augur filter
, we could consider refactoring this code into a new augur subsample
command (in a separate PR). We could also consider using existing alternative software like QIIME’s genome-sampler which also tries to evenly sample across time and genetic diversity.
I would be interested in opinions from others on the team about all of the above.
Code used to generate data for the plot above
The data I plotted above are available as a gist. Here is the code I used to generate the CSV file:
import datetime
import pandas as pd
df = pd.read_csv("data/metadata.tsv", sep="\t")
df = df[~df["date"].str.contains("X")].copy()
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df = df[df["region"] == "North America"].copy()
counts = df.groupby(["division", "year", "month"])["strain"].count().reset_index()
counts["date"] = counts.apply(lambda row: datetime.date(row["year"], row["month"], 1), axis=1)
counts.to_csv("north_america_counts_by_division_year_and_month.csv", index=False)
Thanks for the comments and summary, @huddlej! a few quick remarks:
|
I'd like to bump this as we ran into this again with the builds yesterday. I agree that we may want to revisit a more general discussion on how subsampling could happen, but until then, having some kind of a fix so that we don't run into the issue again would be great. (But then I'm going to contradict myself and just give a little feedback on the genetic distance + time filtering:) |
That's really interesting how the regional builds turn into mini-global builds with the genetic distance and time filtering approach. It makes sense that we want to have as much flexibility as possible to define our subsampling approaches for all of these different scenarios. I'm still catching up with Nextstrain PRs, but I can prototype the function refactoring that I mentioned above this Friday to see whether that clarifies readability. |
Splits the original `_calculate_sequences_per_group` function that supported deterministic and probabilistic sampling into two separate functions to clarify both the function interfaces and the readability of the code. Updates tests for the new `_calculate_fractional_sequences_per_group` function, moves the Snakemake functional test into a cram test, and adds additional functional tests for augur filter with subsampling.
Codecov Report
@@ Coverage Diff @@
## master #629 +/- ##
==========================================
+ Coverage 28.68% 28.76% +0.07%
==========================================
Files 39 39
Lines 5389 5409 +20
Branches 1326 1332 +6
==========================================
+ Hits 1546 1556 +10
- Misses 3785 3795 +10
Partials 58 58
Continue to review full report at Codecov.
|
@rneher and @emmahodcroft, I just pushed commit e4d520c that refactors the logic for calculating sequences per group into separate functions. We could do more to make this interface more generic, but I'm happy with at least having these functions consistently return the expected types and having some minimal tests to confirm they work. |
This is working well for me locally. |
^ I think that more-often-than-not prioritizing genetically similar background strains is confusing for regional builds. I think it makes sense when trying to do targeted genomic epi for say a cluster in Washington State, but doesn't make sense for the big picture surveillance builds that we're doing at the regional level. I'd like to do a direct comparison of flat background vs prioritized background to investigate this. |
I think it would be good to do a direct comparison and see. The idea here is that connections between the focal region and another region are more likely to be maintained, which can be very important - whereas random sampling might not maintain these. But this is the intention and I'd happy for a check on whether it seems to be functioning as well as intended! Might be good to pick out a few situations we can check specifically. |
I don't think this is the right place for this discussion. This (closed) PR doesn't have anything to do with the priorities we use in |
Description of proposed changes
Allow probabilistic sampling of sequences when there are many more sampling categories than target sequences. This PR picks from these categories with some probability. On average, this will result in the requested number of sequences.
Thank you for contributing to Nextstrain!