Distributions could be refactored into classes #41

leo-desbureaux-tellae · 2024-03-05T16:28:43Z

Actual state

In the last state of Bhepop2, EnrichmentSource subclasses (for instance MarginalDistributions') describe the nature of the distributions they are manipulating (qualitative, quantitative with deciles).

This leads to code duplication, for instance when classes like QuantitativeGlobalDistribution also use quantitative distributions, but this time a single one, that describes all the population. Some of the code is then duplicated, like validation of the source data, feature states evaluation, or drawing a value in this feature state.

Proposition

A good implementation would seem to be that EnrichmentSource subclasses would only describe the structure of the distributions (ex: one for each modality) and how they are linked to the population (ex: applies to the individuals with an equal modality), independently of the type of distributions they contain.

class MarginalDistributions(EnrichmentSource):
    attribute_selection: list
    modalities: dict
    distributions: {(attribute, modality): Distribution}
    get_modality_distribution(attribute, modality)

Other classes would then describe the various kind of distributions: qualitative, quantitative deciles, we could even imagine using discrete probability distributions like Poisson. Specific functions, like evaluating an interval probability from a decile distribution by interpolation (interpolate_feature_prob) could me moved to their dedicated class, for instance DecilesDistribution

In the end, instead of having something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    MarginalDistributions <|-- QualitativeMarginalDistributions
    MarginalDistributions <|-- QuantitativeMarginalDistributions
    
    
    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }

namespace Enrichment sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data
        +list feature_values
        +int nb_feature_values
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        <<Abstract>>
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
        -_validate_data_type()
    }

    class QualitativeMarginalDistributions{
        
    }

    class QuantitativeMarginalDistributions{
    
}
}

We would have something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    EnrichmentSource *-- Distribution
    Distribution <|-- DecilesDistribution
    Distribution <|-- QualitativeDistribution
    
   namespace Enrichment { 
    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }
   }

namespace Sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data_distributions
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
    }
}

namespace Distributions {
    class Distribution{
        <<Abstract>>
        +any some_attributes
        +get_prob_of_feature_state(feature_state)
        +evaluate_distribution_on_population(column)
    }

    class QualitativeDistribution{
        +list values
        +list probs
    }

    class DecilesDistribution{
        +list deciles
    }
}

Related issues

This should solve at least partially the following issues:

Questions

Some points are still not clear about such implementation:

Where would be stored the feature states (the states we choose among when enriching a population, like intervals or categories)
- These states are dependent on the distributions, but they can also be a combination of the states of several distributions, as in QuantitativeMarginalDistributions
- Where would be given the parameters used to evaluate these states (abs_minimum and relative_maximum are Distribution wise, while delta_min is more generic).
Getting a value for a feature state (drawing in an interval for instance) is really linked to the nature of the feature state, not really to the Distribution that led to fix its bounds. So do we need to create FeatureState classes too ?
Where do analysis classes stand in all of that ? It seems that Distributions should implement ways to evaluate same nature distributions from a population, and ways to compare it with the source distribution, for instance with plots. And EnrichmentSource classes would provide a way to link these analysis together, for instance by looping other the modalities and creating a plot for each one.

Contribution

This work will likely take some time. Since there are likely no other users of bhepop2 than us, no new methodologies and source types are expected to be developed, so this is a lot of refactoring for no real use.

However, if you are interested in adding code to Bhepop2 and that this work would help you or you would like to initiate it, feel free to contact us !

- New enrichment classes (Close #17) - Refactor code into SyntheticPopulationEnrichment classes, dedicated to the population enrichment process - Refactor code into EnrichmentSource classes, dedicated to the source data used for enrichment - These classes are used in composition: SyntheticPopulationEnrichment instances contain an instance of an EnrichmentSource class - Cleaner methods and fewer calls - Analysis instance creation is exposed in the new classes (Close #36) - Use and store a random number generator in enrichment classes (Close #30) - Improved documentation - Added a new test suite for new modules, removed some tests These changes introduce questions about a potential need for even more refactoring, see #41

leo-desbureaux-tellae added the enhancement New feature or request label Mar 5, 2024

This was referenced Mar 5, 2024

refactor!: enrichment classes #42

Merged

Roadmap v1.0.0 #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributions could be refactored into classes #41

Distributions could be refactored into classes #41

leo-desbureaux-tellae commented Mar 5, 2024 •

edited

Loading

Distributions could be refactored into classes #41

Distributions could be refactored into classes #41

Comments

leo-desbureaux-tellae commented Mar 5, 2024 • edited Loading

Actual state

Proposition

Related issues

Questions

Contribution

leo-desbureaux-tellae commented Mar 5, 2024 •

edited

Loading