fix: prevent multiple credible filters to override spark plan #766

d0choa · 2024-09-16T16:51:25Z

Because the filter_credible_set was applied over the same object, 2 subsequent calls to the same object would apply them on the same dataframe. That's not the expected as 2 calls to the object could return different results as illustrated here:

In [53]: data_sl = StudyLocus(
    ...:         _df=spark.createDataFrame(observed, schema), _schema=StudyLocus.get_schema()
    ...:     )

In [54]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+


In [55]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS95).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+


In [56]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+

This PR creates a new StudyLocus object any time a new call is done to the filter_credible_set function. This is the behaviour after the change:

In [9]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f.
   ...: explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+

In [10]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS95).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
+-----------+--------------------+---------------+---------------+


In [11]: data_sl.annotate_credible_sets().filter_credible_set(CredibleInterval.IS99).df.withColumn("locus", f
    ...: .explode("locus")).select("locus.*").show()
+-----------+--------------------+---------------+---------------+              
|  variantId|posteriorProbability|is95CredibleSet|is99CredibleSet|
+-----------+--------------------+---------------+---------------+
|tagVariantE|                 0.5|           true|           true|
|tagVariantA|                0.44|           true|           true|
|tagVariantC|                0.04|           true|           true|
|tagVariantB|               0.015|          false|           true|
+-----------+--------------------+---------------+---------------+

Tests were passing because they were creating a dataframe for each test.

project-defiant

General question to this PR. Since spark dafarames are immutable, we might enforce directly on the dataset, that the .df should not be allowed to set outside of the constructor(s).

Currently we allow for the .df property to be mutable, which could cause similar cases like the one mentioned here.

I think this topic is up to open discussion.
We should definately enforce one way, not both in our code.

d0choa · 2024-09-17T08:30:42Z

As discussed, this was not causing any issues in the pipeline, but it could easily cause problems.
I'm totally on board with the immutable df property but I would handle it in a separate PR

…/opentargets/gentropy into do_fix_credible_set_filter_issue

This reverts commit a358781.

project-defiant

The wandb version was dumped. Was that on purpose? Judging by the commit names, I guess not :)

pyproject.toml

project-defiant · 2024-09-17T16:40:17Z

There is still some change in the poetry.lock but I assume it's due to poetry version :)

LGTM

d0choa · 2024-09-17T16:45:24Z

I aligned the poetry version. One line changed in the lock, and I'm not sure what caused it. It might even be linked to machine architecture. We can discuss it; I don't think it should be a blocker.

I'm merging now... this has been more painful than it should

fix: prevent multiple filters to override spark plan

7469b77

github-actions bot added bug Something isn't working size-S Dataset labels Sep 16, 2024

d0choa requested review from project-defiant and DSuveges September 16, 2024 16:51

project-defiant reviewed Sep 17, 2024

View reviewed changes

DSuveges approved these changes Sep 17, 2024

View reviewed changes

project-defiant approved these changes Sep 17, 2024

View reviewed changes

project-defiant and others added 3 commits September 17, 2024 13:15

Merge branch 'dev' into do_fix_credible_set_filter_issue

ac7311a

feat: mhc quality control flag

18c38e9

Merge branch 'do_fix_credible_set_filter_issue' of https://github.com…

e7fa4d7

…/opentargets/gentropy into do_fix_credible_set_filter_issue

github-actions bot added size-M Step and removed size-S labels Sep 17, 2024

d0choa added 4 commits September 17, 2024 16:10

fix: prevent multiple filters to override spark plan

a358781

Merge branch 'do_fix_credible_set_filter_issue' of https://github.com…

4f51473

…/opentargets/gentropy into do_fix_credible_set_filter_issue

Revert "fix: prevent multiple filters to override spark plan"

943322e

This reverts commit a358781.

revert: wrong commit

fd326f0

github-actions bot added size-XS and removed size-M Dataset Step labels Sep 17, 2024

fix: missing changes due to git chaos

5ee2709

github-actions bot added size-S Dataset and removed size-XS labels Sep 17, 2024

project-defiant reviewed Sep 17, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

chore: merge dev toml, update lock

74bf627

chore: update lock after upgrading poetry to 1.8.3 v 2

9d34077

d0choa merged commit 6ede736 into dev Sep 17, 2024
4 checks passed

d0choa deleted the do_fix_credible_set_filter_issue branch September 17, 2024 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent multiple credible filters to override spark plan #766

fix: prevent multiple credible filters to override spark plan #766

d0choa commented Sep 16, 2024

project-defiant left a comment

d0choa commented Sep 17, 2024

project-defiant left a comment •

edited

Loading

project-defiant commented Sep 17, 2024 •

edited

Loading

d0choa commented Sep 17, 2024

fix: prevent multiple credible filters to override spark plan #766

fix: prevent multiple credible filters to override spark plan #766

Conversation

d0choa commented Sep 16, 2024

project-defiant left a comment

Choose a reason for hiding this comment

d0choa commented Sep 17, 2024

project-defiant left a comment • edited Loading

Choose a reason for hiding this comment

project-defiant commented Sep 17, 2024 • edited Loading

d0choa commented Sep 17, 2024

project-defiant left a comment •

edited

Loading

project-defiant commented Sep 17, 2024 •

edited

Loading