Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributions could be refactored into classes #41

Open
leo-desbureaux-tellae opened this issue Mar 5, 2024 · 0 comments
Open

Distributions could be refactored into classes #41

leo-desbureaux-tellae opened this issue Mar 5, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@leo-desbureaux-tellae
Copy link
Contributor

leo-desbureaux-tellae commented Mar 5, 2024

Actual state

In the last state of Bhepop2, EnrichmentSource subclasses (for instance MarginalDistributions') describe the nature of the distributions they are manipulating (qualitative, quantitative with deciles).

This leads to code duplication, for instance when classes like QuantitativeGlobalDistribution also use quantitative distributions, but this time a single one, that describes all the population. Some of the code is then duplicated, like validation of the source data, feature states evaluation, or drawing a value in this feature state.

Proposition

A good implementation would seem to be that EnrichmentSource subclasses would only describe the structure of the distributions (ex: one for each modality) and how they are linked to the population (ex: applies to the individuals with an equal modality), independently of the type of distributions they contain.

class MarginalDistributions(EnrichmentSource):
    attribute_selection: list
    modalities: dict
    distributions: {(attribute, modality): Distribution}
    get_modality_distribution(attribute, modality)

Other classes would then describe the various kind of distributions: qualitative, quantitative deciles, we could even imagine using discrete probability distributions like Poisson. Specific functions, like evaluating an interval probability from a decile distribution by interpolation (interpolate_feature_prob) could me moved to their dedicated class, for instance DecilesDistribution

In the end, instead of having something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    MarginalDistributions <|-- QualitativeMarginalDistributions
    MarginalDistributions <|-- QuantitativeMarginalDistributions
    
    
    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }

namespace Enrichment sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data
        +list feature_values
        +int nb_feature_values
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        <<Abstract>>
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
        -_validate_data_type()
    }

    class QualitativeMarginalDistributions{
        
    }

    class QuantitativeMarginalDistributions{
    
}
}
Loading

We would have something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    EnrichmentSource *-- Distribution
    Distribution <|-- DecilesDistribution
    Distribution <|-- QualitativeDistribution
    
   namespace Enrichment { 
    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }
   }

namespace Sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data_distributions
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
    }
}

namespace Distributions {
    class Distribution{
        <<Abstract>>
        +any some_attributes
        +get_prob_of_feature_state(feature_state)
        +evaluate_distribution_on_population(column)
    }

    class QualitativeDistribution{
        +list values
        +list probs
    }

    class DecilesDistribution{
        +list deciles
    }
}
Loading

Related issues

This should solve at least partially the following issues:

Questions

Some points are still not clear about such implementation:

  • Where would be stored the feature states (the states we choose among when enriching a population, like intervals or categories)
    • These states are dependent on the distributions, but they can also be a combination of the states of several distributions, as in QuantitativeMarginalDistributions
    • Where would be given the parameters used to evaluate these states (abs_minimum and relative_maximum are Distribution wise, while delta_min is more generic).
  • Getting a value for a feature state (drawing in an interval for instance) is really linked to the nature of the feature state, not really to the Distribution that led to fix its bounds. So do we need to create FeatureState classes too ?
  • Where do analysis classes stand in all of that ? It seems that Distributions should implement ways to evaluate same nature distributions from a population, and ways to compare it with the source distribution, for instance with plots. And EnrichmentSource classes would provide a way to link these analysis together, for instance by looping other the modalities and creating a plot for each one.

Contribution

This work will likely take some time. Since there are likely no other users of bhepop2 than us, no new methodologies and source types are expected to be developed, so this is a lot of refactoring for no real use.

However, if you are interested in adding code to Bhepop2 and that this work would help you or you would like to initiate it, feel free to contact us !

@leo-desbureaux-tellae leo-desbureaux-tellae added the enhancement New feature or request label Mar 5, 2024
This was referenced Mar 5, 2024
leo-desbureaux-tellae added a commit that referenced this issue Mar 7, 2024
- New enrichment classes (Close #17)
    - Refactor code into SyntheticPopulationEnrichment classes, dedicated to the population enrichment process
    - Refactor code into EnrichmentSource classes, dedicated to the source data used for enrichment
    - These classes are used in composition: SyntheticPopulationEnrichment instances contain an instance of an EnrichmentSource class
    - Cleaner methods and fewer calls
- Analysis instance creation is exposed in the new classes (Close #36)
- Use and store a random number generator in enrichment classes (Close #30)
- Improved documentation
- Added a new test suite for new modules, removed some tests

These changes introduce questions about a potential need for even more refactoring, see #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant