Skip to content

Fix zero percentage replacement in get_binned_data Function and update parameter name#1278

Closed
boemer00 wants to merge 3 commits intoevidentlyai:mainfrom
boemer00:min_non_zero
Closed

Fix zero percentage replacement in get_binned_data Function and update parameter name#1278
boemer00 wants to merge 3 commits intoevidentlyai:mainfrom
boemer00:min_non_zero

Conversation

@boemer00
Copy link

@boemer00 boemer00 commented Sep 2, 2024

This pull request introduces changes that address the issue: The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

  • Dynamic Fill Value Calculation: The fill value used to replace zero percentages is now calculated dynamically based on the actual data, rather than using a fixed value.

  • Ensuring Correct Fill Value: The fill value is guaranteed to be smaller than the minimum non-zero percentage in both the reference and current datasets. This adjustment ensures that the data distribution remains accurate.

  • Maintaining Data Distribution: These changes help maintain the correct distribution of data, which is crucial for accurate statistical tests, including Kullback-Leibler divergence drift score calculations.

  • Parameter Name Update: The parameter name has been updated from "feel_zeroes" to "fill_zeroes" for clarity and consistency.

Hope it helps
boemer00

@boemer00
Copy link
Author

boemer00 commented Sep 5, 2024

Enhance get_binned_data Function with Flexible Fill Methods and Dynamic Scaling

This pull request introduces enhancements to my previous PR regarding the get_binned_data function. These changes address issues with handling very small percentages in data drift calculations while providing some flexibility and maintaining its original purpose.

Changes Made:

  1. New Parameters:

    • fill_method (str, default='auto'): Allows users to select the method for calculating the fill value for zero percentages.
    • dynamic_scale (bool, default=False): Enables dynamic scaling of the fill value based on the data distribution.
  2. Fill Methods: Three fill methods are now supported:

    • auto: Calculates a fill value that is the minimum of 1/10 and 1/2 of the smallest non-zero percentage.
    • min: Uses the minimum non-zero percentage as the fill value.
    • mean: Uses the average of the minimum non-zero percentages from reference and current data.
  3. Dynamic Scaling: When enabled, applies an additional scaling factor to the fill value:

    scale_factor = min(min_non_zero_ref, min_non_zero_cur) / max(min_non_zero_ref, min_non_zero_cur)

This is beneficial in cases where there is a significant difference between the minimum non-zero percentages in the reference and current data.

  1. Normalisation: After applying the fill value, the percentages are normalised to ensure they always sum to 1:

reference_percents = reference_percents / np.sum(reference_percents)
current_percents = current_percents / np.sum(current_percents)

  1. Error Handling: Added a ValueError for invalid fill_method values to prevent unexpected behaviour.

Lastly, I am happy to write a test for the new parameters but would appreciate direction on where this should be implemented.

Thanks

@emeli-dral
Copy link
Contributor

The linter check has been failing for a while, and now the branch also has merge conflicts with main. Given the current state of the repo, resolving both issues would likely be more work than starting fresh.

If the changes are still relevant, feel free to rebase onto the latest main and open a new PR. Happy to review it then!

@bharath03-a
Copy link
Contributor

bharath03-a commented Jul 14, 2025

Hey @boemer00, thank you for the original work here. Since this branch had diverged and had some linter and merge issues, I’ve opened a fresh PR based on your implementation and rebased it on the latest main.

New PR: #1660

Hello, @emeli-dral! Happy to continue iteration there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants