Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent multiple credible filters to override spark plan #766

Merged
merged 11 commits into from
Sep 17, 2024

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Sep 16, 2024

Because the filter_credible_set was applied over the same object, 2 subsequent calls to the same object would apply them on the same dataframe. That's not the expected as 2 calls to the object could return different results as illustrated here:

In [53]: data_sl = StudyLocus(
    ...:         _df=spark.createDataFrame(observed, schema), _schema=StudyLocus.get_schema()
    ...:     )

In [54]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+


In [55]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS95).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+


In [56]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+

This PR creates a new StudyLocus object any time a new call is done to the filter_credible_set function. This is the behaviour after the change:

In [9]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f.
   ...: explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+

In [10]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS95).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+


In [11]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+

Tests were passing because they were creating a dataframe for each test.

@github-actions github-actions bot added bug Something isn't working size-S Dataset labels Sep 16, 2024
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question to this PR. Since spark dafarames are immutable, we might enforce directly on the dataset, that the .df should not be allowed to set outside of the constructor(s).

Currently we allow for the .df property to be mutable, which could cause similar cases like the one mentioned here.

I think this topic is up to open discussion.
We should definately enforce one way, not both in our code.

@d0choa
Copy link
Collaborator Author

d0choa commented Sep 17, 2024

As discussed, this was not causing any issues in the pipeline, but it could easily cause problems.
I'm totally on board with the immutable df property but I would handle it in a separate PR

Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wandb version was dumped. Was that on purpose? Judging by the commit names, I guess not :)

pyproject.toml Outdated Show resolved Hide resolved
@project-defiant
Copy link
Contributor

project-defiant commented Sep 17, 2024

There is still some change in the poetry.lock but I assume it's due to poetry version :)

LGTM

@d0choa
Copy link
Collaborator Author

d0choa commented Sep 17, 2024

I aligned the poetry version. One line changed in the lock, and I'm not sure what caused it. It might even be linked to machine architecture. We can discuss it; I don't think it should be a blocker.

I'm merging now... this has been more painful than it should

@d0choa d0choa merged commit 6ede736 into dev Sep 17, 2024
4 checks passed
@d0choa d0choa deleted the do_fix_credible_set_filter_issue branch September 17, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dataset size-S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants