Releases: pegasystems/pega-datascientist-tools
Pega Data Scientist Tools V4
This release of pdstools is a big cleanup from version 3. A lot of changes are breaking - but that's for the best: pdstools is now much easier to maintain, new functionality has a more logical place to go, and the API should be a lot more intuitive. The goal is for the initial V4 release to contain most of the breaking API changes we foresee in a long time. Then, we can of course still change the inner functionality and/or add new functions - but hopefully the most important function schemas/API don't need more changes anytime soon.
✨Highlights
- Farewell R - you've served us well, but pdstools is now Python only
- Introducing the Pega DX API Client
- Starting out with support for the 24.2 Prediction Studio and Knowledge Buddy APIs
- Major refactor of the entire codebase: consistent python naming, optional dependency groups, well-defined typehints
❌Deprecations/removals
- The R version of pdstools has been removed. In case you still want to use the R tools, you should manually clone the repo at the V3.x tag.
- The legacy IH utilities have been dropped. These were old parts of the codebase and untested/unused. New IH utilities are on their way!
- The Wiki documentation has been ported to the (tracked) Python documentation. We'll deprecate the wiki, but keep it live to give external links some time to link to the documentation instead.
🔨Changes
- Consistent pythonic casing, meaning
PascalCase
for classes &snake_case
for methods, variables & arguments - Much improved typehints, so it's much more obvious what the response of a given function will be
- Fewer 'base' dependencies; different functionality is split up into 'namespaces' that all have their own set of requirements
- The first time you invoke a method in a 'namespace', it verifies the dependencies and gives a clear warning if any are missing
- To expand on the previous point: functionality is split up much more logically. Taking the ADMDatamart class as an example:
- Plotting functionality is part of
ADMDatamart.plot.bubble_chart()
(or any other plot of course) - The health check and other reports are part of
ADMDatamart.generate.health_check()
(for instance) - The intermediate aggregations needed are part of
ADMDatamart.aggregations.pivot()
(for instance)
- Plotting functionality is part of
- Using
classmethod
s, we can initialize the ADMDatamart class in particular in a much more flexible way.- The main
__init__
method of the ADMDatamart class is very simple: it expects twopolars.LazyFrame
s; one formodel_data
and one forprediction_data
. If you've already read in your data, simply use this - If, instead, you want to use the previous functionality which automatically found the most recent file in a folder, you should initialize the datamart class like
ADMDatamart.from_ds_export()
- Or, if instead, you are consuming the results of a data flow (including the OOTB Prediction Studio export), you can simply initialize the datamart class like
ADMDatamart.from_dataflow_export(model_data="pattern_for_model_files*.json", predictor_data="pattern_for_predictor_files*.json")
. We can also cache the files we've read in before by writing to a 'cache' file automatically - this makes things move quickly. This closes #205 as well.
- The main
Full Changelog: V3.5.2...V4.0.0
Pdstools V4 beta 1
V4 brings some pretty major (and necessary) changes. A lot of them are, unfortunately, breaking - but it's for the best. pdstools is now much easier to maintain and keep consistent, and new functionality now has a much more logical place to go.
The goal is for the initial V4 release to contain most of the breaking (API-centric) changes we foresee in a long time. Then, we can of course still change the inner functionality and/or add new functions - but hopefully the most important function schemas/API don't need more changes anytime soon.
✨Highlights
- Farewell R - you've served us well, but pdstools is now Python only
- Introducing the Pega DX API Client
- Starting out with support for the 24.2 Prediction Studio and Knowledge Buddy APIs
- Major refactor of the entire codebase: consistent python naming, optional dependency groups, well-defined typehints
❌Deprecations/removals
- The R version of pdstools has been removed. In case you still want to use the R tools, you should manually clone the repo at the V3.x tag.
- The legacy IH utilities have been dropped. These were old parts of the codebase and untested/unused. New IH utilities are on their way!
- The Wiki documentation has been ported to the (tracked) Python documentation. We'll deprecate the wiki, but keep it live to give external links some time to link to the documentation instead.
🔨Changes
- Consistent pythonic casing, meaning
PascalCase
for classes &snake_case
for methods, variables & arguments - Much improved typehints, so it's much more obvious what the response of a given function will be
- Fewer 'base' dependencies; different functionality is split up into 'namespaces' that all have their own set of requirements
- The first time you invoke a method in a 'namespace', it verifies the dependencies and gives a clear warning if any are missing
- To expand on the previous point: functionality is split up much more logically. Taking the ADMDatamart class as an example:
- Plotting functionality is part of
ADMDatamart.plot.bubble_chart()
(or any other plot of course) - The health check and other reports are part of
ADMDatamart.generate.health_check()
(for instance) - The intermediate aggregations needed are part of
ADMDatamart.aggregations.pivot()
(for instance)
- Plotting functionality is part of
- Using
classmethod
s, we can initialize the ADMDatamart class in particular in a much more flexible way.- The main
__init__
method of the ADMDatamart class is very simple: it expects twopolars.LazyFrame
s; one formodel_data
and one forprediction_data
. If you've already read in your data, simply use this - If, instead, you want to use the previous functionality which automatically found the most recent file in a folder, you should initialize the datamart class like
ADMDatamart.from_ds_export()
- Or, if instead, you are consuming the results of a data flow (including the OOTB Prediction Studio export), you can simply initialize the datamart class like
ADMDatamart.from_dataflow_export(model_data="pattern_for_model_files*.json", predictor_data="pattern_for_predictor_files*.json")
. We can also cache the files we've read in before by writing to a 'cache' file automatically - this makes things move quickly. This closes #205 as well.
- The main
Todo before release:
- Update Pega Academy article https://academy.pega.com/topic/data-scientist-tools-customer-decision-hub/v1
- Further improve test coverage
- Complete missing docstrings
- Perform further internal testing
- Ensure all linked issues are fixed
- Improve some of the optional imports that are imported on library import
Full Changelog: V4.0.0-alpha.1...V4.0.0-beta.1
Pdstools V4 alpha 1
V4 brings some pretty major (and necessary) changes. A lot of them are, unfortunately, breaking - but it's for the best. pdstools is now much easier to maintain and keep consistent, and new functionality now has a much more logical place to go.
The goal is for the initial V4 release to contain most of the breaking (API-centric) changes we foresee in a long time. Then, we can of course still change the inner functionality and/or add new functions - but hopefully the most important function schemas/API don't need more changes anytime soon.
✨Highlights
- Farewell R - you've served us well, but pdstools is now Python only
- Introducing the Pega DX API Client
- Starting out with support for the 24.2 Prediction Studio and Knowledge Buddy APIs
- Major refactor of the entire codebase: consistent python naming, optional dependency groups, well-defined typehints
❌Deprecations/removals
- The R version of pdstools has been removed. In case you still want to use the R tools, you should manually clone the repo at the V3.x tag.
- The legacy IH utilities have been dropped. These were old parts of the codebase and untested/unused. New IH utilities are on their way!
🔨Changes
- Consistent pythonic casing, meaning
PascalCase
for classes &snake_case
for methods, variables & arguments - Much improved typehints, so it's much more obvious what the response of a given function will be
- Fewer 'base' dependencies; different functionality is split up into 'namespaces' that all have their own set of requirements
- The first time you invoke a method in a 'namespace', it verifies the dependencies and gives a clear warning if any are missing
- To expand on the previous point: functionality is split up much more logically. Taking the ADMDatamart class as an example:
- Plotting functionality is part of
ADMDatamart.plot.bubble_chart()
(or any other plot of course) - The health check and other reports are part of
ADMDatamart.generate.health_check()
(for instance) - The intermediate aggregations needed are part of
ADMDatamart.aggregations.pivot()
(for instance)
- Plotting functionality is part of
- Using
classmethod
s, we can initialize the ADMDatamart class in particular in a much more flexible way.- The main
__init__
method of the ADMDatamart class is very simple: it expects twopolars.LazyFrame
s; one formodel_data
and one forprediction_data
. If you've already read in your data, simply use this - If, instead, you want to use the previous functionality which automatically found the most recent file in a folder, you should initialize the datamart class like
ADMDatamart.from_ds_export()
- Or, if instead, you are consuming the results of a data flow (including the OOTB Prediction Studio export), you can simply initialize the datamart class like
ADMDatamart.from_dataflow_export(model_data="pattern_for_model_files*.json", predictor_data="pattern_for_predictor_files*.json")
. We can also cache the files we've read in before by writing to a 'cache' file automatically - this makes things move quickly. This closes #205 as well.
- The main
Todo before release:
- Update Pega Academy article https://academy.pega.com/topic/data-scientist-tools-customer-decision-hub/v1
- Further improve test coverage
- Complete missing docstrings
- Perform further internal testing
- Ensure all linked issues are fixed
- Improve some of the optional imports that are imported on library import
Pega Data Scientist Tools V3.5.0: Polars V1
While we've been hard at work creating version 4 of pdstools, I wanted to get one last release for V3 out of the way.
V4 brings some pretty sizable changes; we'll deprecate the R tools and fully rework all python classes to make them more consistent & maintainable, including renaming pretty much all classes & methods. Since things will be breaking, it may be desirable to sometimes fall back to V3 while transitioning. However, our V3 branch was falling out of date mainly because we were tied to Polars < 1. This minor update brings Polars V1 support. It does require Polars > 1.9 as we were facing a ipc serialization bug in earlier versions. We likely will not support the V3 branch out into the future, but if necessary we could accept small bug fixes down the line.
What's Changed
- Make Quarto Render OS-Agnostic and Add Version Info to Health Check Logs by @yusufuyanik1 in #257
- Fixing issue with inconsitent coloring and ordering by @operdeck in #261
- [WIP] Compatibility with Polars V1 by @StijnKas in #233
Full Changelog: V3.4.7...V3.5.0
Pega Data Scientist Tools V3.4.6
Enhancement and Bug Fixes
This patch release, V3.4.6, includes multiple enhancements and bug fixes aimed at improving the functionality and user experience
Key updates include Improved logging for Health Check(#250), handling empty data frames gracefully, prediction analysis improvements and more.
What's Changed
- Make regex strings raw strings by @StijnKas in #232
- Gracefully handle empty data frame for extract_keys by @StijnKas in #234
- Prediction analysis by @operdeck in #237
- Fixed timezone issue with new polars by @operdeck in #238
- remove DA from EE article by @yusufuyanik1 in #240
- fix the sample data path by @yusufuyanik1 in #241
- Experimental extra plot to show class separation by @operdeck in #239
- Moved PDC specifics back to a separate class. Made ADM more robust ag… by @operdeck in #243
- Missing cast of by-period to date in summaries by @operdeck in #245
- Prediction fixes by @operdeck in #246
- Numbers formatted more human friendly by @operdeck in #247
- Improved signature of AUC from bin methods to support a direct ordering by @operdeck in #248
Full Changelog: V3.4.4...V3.4.6
Pega Data Scientist Tools V3.4: Binning Insights
Rolling up the ADM bins
This release adds additional insights from the ADM binning data, letting you find information on predictors across models and channels!
Check out the explainer article here.
What's Changed
- Issue 119 by @operdeck in #185
- Added a few more test cases by @operdeck in #187
- Improved coverage by @operdeck in #188
- Improved coverage by @operdeck in #189
- ISSUE_186 by @operdeck in #190
- Support some of the accounts with many more configurations by @operdeck in #193
- Added aggregated bin insights to examples by @operdeck in #197
- Reviewed text sections by @operdeck in #198
Full Changelog: V3.3...V3.4
Pega Data Scientist Tools V3.3
HealthCheck App Changes
In this release, we've primarily focused on improving the HealthCheck app, making it more powerful and user-friendly.
- Beyond the comprehensive global HealthCheck report, you now have the capability to generate individual model reports. These reports allow you to delve into the performance of a specific model, providing an in-depth view of predictors and their individual effects on propensity.
- You can now run HealthCheck in the cloud, directly from GitHub without the need to install any tools. A Github codespace is a development environment that's hosted in the cloud. Each codespace you create is hosted by GitHub in a Docker container, running on a virtual machine. See the Wiki for all the possible ways to run the ADM HealthCheck.
- If our out-of-the-box report isn't quite what you're looking for, you can now export the latest Datamart Snapshot in Excel format, empowering you to perform your own custom analysis.
- You can save your filters and upload them later to avoid repetition.
Code Changes
- Added support for Python 3.12
- Aligned with performance improvements coming with the latest version of polars
What's Changed
- Health Check update by @yusufuyanik1 in #115
- HealthCheck fixes by @yusufuyanik1 in #117
- Article Fixes by @yusufuyanik1 in #121
- Revert changes in Data Anonymization article by @yusufuyanik1 in #122
- Made the off line reports shine even more by @operdeck in #123
- fix overtime plot metric by @yusufuyanik1 in #124
- polars version upgrade by @yusufuyanik1 in #125
- polars version update remaining by @yusufuyanik1 in #126
- Initial cut of python version of off-line model reports by @operdeck in #129
- Usability updates to the Value Finder code by @StijnKas in #131
- Update test to reflect the new treatment col by @StijnKas in #133
- Add devcontainer in support for codespaces by @StijnKas in #135
- Added mostly TODOs and fixed some text and simple layout things by @operdeck in #137
- Added trivial cmd line args for easier calling from the outside by @operdeck in #139
- Issue 128 by @operdeck in #142
- ALigned channel overview by @operdeck in #143
- Update-azureopenai-version by @StijnKas in #150
- Reports more robust for various cornercases by @operdeck in #151
- Included colored styling in tables to highlight issues. Made model re… by @operdeck in #152
- Hc improvements by @operdeck in #154
- Hc improvements by @operdeck in #161
- Fixed incorrect BinIndex type by @StijnKas in #162
- Bump version & bump polars min version by @StijnKas in #164
- Standalone Report improvements by @operdeck in #166
- Replaced gains charts in HC by @operdeck in #168
- Supporting utilities to run reports in batch and unattended by @operdeck in #170
- Bump azure openai version to latest, support openai v1 by @StijnKas in #169
- Streamlit App Changes by @yusufuyanik1 in #155
- Doc cleanup by @operdeck in #172
- Dropped old batch scripts in favor of recently introduced new ones by @operdeck in #173
- HealthCheck fix by @yusufuyanik1 in #171
- remove context_keys selection in app by @yusufuyanik1 in #174
- Fix errors and depreciations from polars version bump by @yusufuyanik1 in #176
- improve polars patch PR by @yusufuyanik1 in #177
- Python 3.12 support, bump version by @StijnKas in #144
Full Changelog: V3.2...V3.3
Pega Data Scientist Tools V3.2
What's Changed
Nothing major in this release, but a lot of bugfixes and versioning compatibilities. For an overview of pull requests:
- Use local path in docs workflow by @StijnKas in #89
- Fix: Cast cat column to str in ADMExplained by @yusufuyanik1 in #92
- Health Check Set Up article added by @yusufuyanik1 in #93
- Fix: error logging and version mismatches by @yusufuyanik1 in #94
- Adm explained revision by @yusufuyanik1 in #97
- Added formula for using Beta directly with positives by @operdeck in #98
- Health Check consistency fixes by @yusufuyanik1 in #99
- Support finding active range in classifiers through a convenience fun by @operdeck in #105
- Supporting active range AUCs in standard offline Model reports by @operdeck in #106
- Fixes score calculation details see https://github.com/pegasystems/pe… by @operdeck in #108
- Changed to a better example predictor and added more formulae by @operdeck in #109
- Changed to a better example predictor and added more formulae by @operdeck in #110
- series.cut() fix along with ADMExplained reproducibility improvement by @yusufuyanik1 in #111
- ADMExplained fix by @yusufuyanik1 in #112
- freeze polars version and article fixes by @yusufuyanik1 in #113
Full Changelog: V3.1...V3.2
Pega Data Scientist Tools V3.1
In case you haven’t seen it yet, V3.0 brought many important changes. V3.1 is a minor release, but brings some nice usability changes:
What’s new?
- Rebuilt the
pdstools app
to be much more user-friendly and easy to use - Added a models-only Health Check
- Added a
Tables
class to generate tables and export them to Excel - Allow for multiple
pl.Expr
s in ._apply_query - Added Thompson Sampling & ADM Explained articles to the Python docs
- Added plotPredictorContribution to the main plots
- Basic S3 tools, for reading Pega Repository datasets, including get_ADMDatamart to get the datamart directly from S3
- pdstools.show_versions allows you to easily get the versions of installed packages
- Issue templates to more easily and clearly define gh issues
What’s changed?
- Moved Health Check generation responsibility to the
ADMDatamart
class - Moved Health Check files to
reports
- Updated Value Finder to a more streamlined implementation
- Separated IO into
pega_io
- Support for more OOTB timestamp formats
- Fixed a bug in
getMultiTrees
that caused different models to not separate properly - Fixed compatibility with Polars version 0.17
Technical improvements
- Automated documentation builds & deployment
- Automated pypi releases
- Automated tests for Health Check
- Fixed tests not being found by VS Code
Full Changelog: V3.0...V3.1
Pega Data Scientist Tools V3.0
A new major version with major changes!
Highlights:
- Pdstools now uses Polars as the backend, replacing Pandas. See this article for a summary of the changes
- The crowd favorite ADM Healthcheck has been fully ported over to Python, alongside a streamlit app. Simply call
pdstools run
in your terminal to get started! - The matplotlib plots have been deprecated, and only plotly is supported. Plots that were
matplotlib
only have been removed. - Added data anonymization tools, see this article for more information
Other changes
- Minimum Python version is now bumped to 3.8
- The new Polars backend touched almost all areas of the codebase. All plot functions, backend functions and aggregations have been ported over.
- cdh_utils & ADMDatamart imports return
pl.LazyFrame
s by default - Overwrite mapping functionality is removed. If you need the legacy functionality, you can manually read in the data as
pl.lazyFrame
s, and then call.rename()
ModelName
is renamed toName
for consistencyADMDatamart
keyword arguments have been added to the main class signature, making them easier to find & usequery
arguments should now usepl.Expr
for querying, keeping the lazy execution path aliveextract_treatment
has been renamed toextract_keys
, and is now just boolean. If True, will extract all extra keys inpyName
- Added
last_ResponseCount
andlast_Positives
columns, indicating the last timestamp either of these columns increased. This is useful for estimating wether an action has stopped getting responses, therefore being turned to inactive - Added a
save_data()
method to theADMDatamart
class, that will save themodelData
andpredictorData
to local files - Updated docstrings & tests to be consistent and up-to-date
- Added a
FeatureImportance
function, closing #49
New Contributors
- @shaniyahassanali made their first contribution in #70
- @yusufuyanik1 made their first contribution in #73
Full Changelog: V2.2...V3.0