You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the last state of Bhepop2, EnrichmentSource subclasses (for instance MarginalDistributions') describe the nature of the distributions they are manipulating (qualitative, quantitative with deciles).
This leads to code duplication, for instance when classes like QuantitativeGlobalDistribution also use quantitative distributions, but this time a single one, that describes all the population. Some of the code is then duplicated, like validation of the source data, feature states evaluation, or drawing a value in this feature state.
Proposition
A good implementation would seem to be that EnrichmentSource subclasses would only describe the structure of the distributions (ex: one for each modality) and how they are linked to the population (ex: applies to the individuals with an equal modality), independently of the type of distributions they contain.
Other classes would then describe the various kind of distributions: qualitative, quantitative deciles, we could even imagine using discrete probability distributions like Poisson. Specific functions, like evaluating an interval probability from a decile distribution by interpolation (interpolate_feature_prob) could me moved to their dedicated class, for instance DecilesDistribution
In the end, instead of having something like that
---
title: Enrichment classes
---
classDiagram
SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
SyntheticPopulationEnrichment *-- EnrichmentSource
EnrichmentSource <|-- MarginalDistributions
MarginalDistributions <|-- QualitativeMarginalDistributions
MarginalDistributions <|-- QuantitativeMarginalDistributions
class SyntheticPopulationEnrichment{
<<Abstract>>
+DataFrame population
+EnrichmentSource source
+String feature_name
+int seed
+assign_features()
+compare_with_source()
}
class Bhepop2Enrichment{
+MarginalDistributions source
-_optimise()
-_get_feature_probs()
}
namespace Enrichment sources {
class EnrichmentSource{
<<Abstract>>
+any data
+list feature_values
+int nb_feature_values
+get_value_for_feature(feature)
+compare_with_populations(populations, feature_name)
}
class MarginalDistributions{
<<Abstract>>
+list attribute_selection
+dict modalities
+get_modality_distribution()
+compute_feature_prob(attribute, modality)
-_validate_data_type()
}
class QualitativeMarginalDistributions{
}
class QuantitativeMarginalDistributions{
}
}
Loading
We would have something like that
---
title: Enrichment classes
---
classDiagram
SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
SyntheticPopulationEnrichment *-- EnrichmentSource
EnrichmentSource <|-- MarginalDistributions
EnrichmentSource *-- Distribution
Distribution <|-- DecilesDistribution
Distribution <|-- QualitativeDistribution
namespace Enrichment {
class SyntheticPopulationEnrichment{
<<Abstract>>
+DataFrame population
+EnrichmentSource source
+String feature_name
+int seed
+assign_features()
+compare_with_source()
}
class Bhepop2Enrichment{
+MarginalDistributions source
-_optimise()
-_get_feature_probs()
}
}
namespace Sources {
class EnrichmentSource{
<<Abstract>>
+any data_distributions
+get_value_for_feature(feature)
+compare_with_populations(populations, feature_name)
}
class MarginalDistributions{
+list attribute_selection
+dict modalities
+get_modality_distribution()
+compute_feature_prob(attribute, modality)
}
}
namespace Distributions {
class Distribution{
<<Abstract>>
+any some_attributes
+get_prob_of_feature_state(feature_state)
+evaluate_distribution_on_population(column)
}
class QualitativeDistribution{
+list values
+list probs
}
class DecilesDistribution{
+list deciles
}
}
Loading
Related issues
This should solve at least partially the following issues:
Some points are still not clear about such implementation:
Where would be stored the feature states (the states we choose among when enriching a population, like intervals or categories)
These states are dependent on the distributions, but they can also be a combination of the states of several distributions, as in QuantitativeMarginalDistributions
Where would be given the parameters used to evaluate these states (abs_minimum and relative_maximum are Distribution wise, while delta_min is more generic).
Getting a value for a feature state (drawing in an interval for instance) is really linked to the nature of the feature state, not really to the Distribution that led to fix its bounds. So do we need to create FeatureState classes too ?
Where do analysis classes stand in all of that ? It seems that Distributions should implement ways to evaluate same nature distributions from a population, and ways to compare it with the source distribution, for instance with plots. And EnrichmentSource classes would provide a way to link these analysis together, for instance by looping other the modalities and creating a plot for each one.
Contribution
This work will likely take some time. Since there are likely no other users of bhepop2 than us, no new methodologies and source types are expected to be developed, so this is a lot of refactoring for no real use.
However, if you are interested in adding code to Bhepop2 and that this work would help you or you would like to initiate it, feel free to contact us !
The text was updated successfully, but these errors were encountered:
- New enrichment classes (Close#17)
- Refactor code into SyntheticPopulationEnrichment classes, dedicated to the population enrichment process
- Refactor code into EnrichmentSource classes, dedicated to the source data used for enrichment
- These classes are used in composition: SyntheticPopulationEnrichment instances contain an instance of an EnrichmentSource class
- Cleaner methods and fewer calls
- Analysis instance creation is exposed in the new classes (Close#36)
- Use and store a random number generator in enrichment classes (Close#30)
- Improved documentation
- Added a new test suite for new modules, removed some tests
These changes introduce questions about a potential need for even more refactoring, see #41
Actual state
In the last state of Bhepop2, EnrichmentSource subclasses (for instance MarginalDistributions') describe the nature of the distributions they are manipulating (qualitative, quantitative with deciles).
This leads to code duplication, for instance when classes like QuantitativeGlobalDistribution also use quantitative distributions, but this time a single one, that describes all the population. Some of the code is then duplicated, like validation of the source data, feature states evaluation, or drawing a value in this feature state.
Proposition
A good implementation would seem to be that EnrichmentSource subclasses would only describe the structure of the distributions (ex: one for each modality) and how they are linked to the population (ex: applies to the individuals with an equal modality), independently of the type of distributions they contain.
Other classes would then describe the various kind of distributions: qualitative, quantitative deciles, we could even imagine using discrete probability distributions like Poisson. Specific functions, like evaluating an interval probability from a decile distribution by interpolation (interpolate_feature_prob) could me moved to their dedicated class, for instance DecilesDistribution
In the end, instead of having something like that
We would have something like that
Related issues
This should solve at least partially the following issues:
Questions
Some points are still not clear about such implementation:
Contribution
This work will likely take some time. Since there are likely no other users of bhepop2 than us, no new methodologies and source types are expected to be developed, so this is a lot of refactoring for no real use.
However, if you are interested in adding code to Bhepop2 and that this work would help you or you would like to initiate it, feel free to contact us !
The text was updated successfully, but these errors were encountered: