Releases: CrayLabs/SmartSim
v0.8.0
Released on 27 September, 2024
What's Changed
Features
- Implement support for SGE by @ashao in #610
- Add ability to specify hardware policies on dragon run requests by @ankona in #638
- Added type checking to params on model by @juliaputko in #676
Bug Fixes
- Fix build error caused by use of deprecated pkg_resources by @ankona in #598
- Mitigate dragon/numpy, mypy/typing_extension dependency issues by @ankona in #653
Build Improvements
- Building SmartSim without ML backends by @m-kurz in #601
- Remove builder from setup.py by @ashao in #654
- Refactor RedisAI Build by @ashao in #669
Miscellaneous Improvements
- Update tutorials and tutorial containers by @al-rigazzi in #589
- Fix util-tests outputs appearing in root directory by @ankona in #614
- Restrict to numpy 1.x by @ashao in #623
- Remove broken redis documentation links by @ankona in #627
- More easily discoverable dependencies by @ashao in #635
- Update codecov to 4.5.0 by @mellis13 in #657
- Pin watchdog version to prevent mypy errors by @ashao in #690
- Refine install documentation for Perlmutter and Frontier by @ashao in #717
- Change 'conda activate' to 'source activate' for Frontier by @ashao in #719
- Make a user-specific db cache by @ashao in #727
- Update release action to remove CI Build Wheel by @MattToast in #728
New Contributors
Full Changelog: v0.7.0...v0.8.0
v0.7.0
Released on 14 May, 2024
What's Changed
Features
- Specify node feature for slurm job by @amandarichardsonn in #529
- Optionally skip building Torch with Intel MKL by @ashao in #538
- Store SmartSim entity logs under the .smartsim directory by @AlyssaCote in #532
- Change default path for entities by @amandarichardsonn in #533
- Preview by @juliaputko in #525
- Dragon launcher by @al-rigazzi in #580
Bug Fixes
- Update Redis dependency to 7.2.4 by @ankona in #507
- Formatting in Jupyter Notebooks by @amandarichardsonn in #516
- Correct ExecArgs Handling During RunSetting by @amandarichardsonn in #517
- Application executes before colocated Orchestrator is created by @amandarichardsonn in #522
- Fix telemetry monitor logging errors for task history by @ankona in #557
- Remove init_default function by @amandarichardsonn in #545
- Symlink batch ensembles and batch models by @AlyssaCote in #547
API Breaks
- Change Status Module by @amandarichardsonn in #509
- Remove Long Deprecated SmartSim Modules by @amandarichardsonn in #514
- Discontinue SmartSim support for python 3.8 by @AlyssaCote in #544
Miscellaneous Improvements
- Duplicate entity name prevention by @amandarichardsonn in #480
- Change generic t.any in Experiment API by @mellis13 in #501
- Smartsim Documentation Refactor by @amandarichardsonn in #463
- Enforce changelog for SmartSim PRs by @amandarichardsonn in #518
- ReadTheDocs Configuration File by @amandarichardsonn in #512
- Remove duplicate launched model names from full test suite by @MattToast in #520
- Mypy: Error on Common Truthy Mistakes by @MattToast in #524
- Add memory & conn collector, collector manager, tests by @ankona in #460
- Promote Build Device Option to Enum by @amandarichardsonn in #527
- Disallow Uninitialized Variable Use by @MattToast in #521
- Readthedocs import extension error by @amandarichardsonn in #537
- Enhanced Signal Management by @MattToast in #535
- Update watchdog dependency by @ankona in #540
- Upgrade ubuntu to 22.04 by @AlyssaCote in #558
- Bump manifest.json version to 0.0.4 by @AlyssaCote in #563
- Force typing_extensions==4.6.1 in doc build by @ashao in #564
- Adapt tests to reuse Orchestrator by @ashao in #567
- Dragon server enhancement by @al-rigazzi in #582
Full Changelog: v0.6.2...v0.7.0
v0.6.2
0.6.2
Released on 16 February, 2024
Description
- Patch SmartSim dependency version
Detailed Notes
- A critical performance concern was identified and addressed in SmartRedis. A patch fix was deployed, and SmartSim was updated to ensure users do not inadvertently pull the unpatched version of SmartRedis. (SmartSim-PR493)
v0.6.1
Released on 15 February, 2024
Description
- Duplicate for DBModel/Script prevented
- Update license to include 2024
- Telemetry monitor is now active by default
- Add support for Mac OSX on Apple Silicon
- Remove Torch warnings during testing
- Validate Slurm timing format
- Expose Python Typehints
- Fix test_logs to prevent generation of directory
- Fix Python Typehint for colocated database settings
- Python 3.11 Support
- Quality of life smart validate improvements
- Remove Cobalt support
- Enrich logging through context variables
- Upgrade Machine Learning dependencies
- Override sphinx-tabs background color
- Add concurrency group to test workflow
- Fix index when installing torch through smart build
Detailed Notes
- Modify the git clone for both Redis and RedisAI to set the line endings to unix-style line endings when using MacOS on ARM. (SmartSim-PR482)
- Separate install instructions are now provided for Mac OSX on x64 vs ARM64 (SmartSim-PR479)
- Prevent duplicate ML model and script names being added to an Ensemble member if the names exists. (SmartSim-PR475)
- Updates Copyright (c) 2021-2023 to Copyright (c) 2021-2024 in all of the necessary files. (SmartSim-PR485)
- Bug fix which prevents the expected behavior when the SMARTSIM_LOG_LEVEL environment variable was set to developer. (SmartSim-PR473)
- Sets the default value of the "enable telemetry" flag to on. Bumps the output manifest.json version number to match that of smartdashboard and pins a watchdog version to avoid build errors. (SmartSim-PR477)
- Refactor logic of Manifest.has_db_objects to remove excess branching and improve readability/maintainability. (SmartSim-PR476)
- SmartSim can now be built and used on platforms using Apple Silicon (ARM64). Currently, only the PyTorch backend is supported. Note that libtorch will be downloaded from a CrayLabs github repo. (SmartSim-PR465)
- Tests that were saving Torch models were emitting warnings. These warnings were addressed by updating the model save test function. (SmartSim-PR472)
- Validate the timing format when requesting a slurm allocation. (SmartSim-PR471)
- Add and ship py.typed marker to expose inline type hints. Fix type errors related to SmartRedis. (SmartSim-PR468)
- Fix the test_logs.py::test_context_leak test that was erroneously creating a directory named some value in SmartSim's root directory. (SmartSim-PR467)
- Add Python type hinting to colocated settings. (SmartSim-PR462)
- Add github actions for running black and isort checks. (SmartSim-PR464)
- Relax the required version of typing_extensions. (SmartSim-PR459)
- Addition of Python 3.11 to SmartSim. (SmartSim-PR461)
- Quality of life smart validate improvements such as setting CUDA_VISIBLE_DEVICES environment variable within smart validate prior to importing any ML deps to prevent false negatives on multi-GPU systems. Additionally, move SmartRedis logs from standard out to dedicated log file in the validation temporary directory as well as suppress sklearn deprecation warning by pinning KMeans constructor argument. Lastly, move TF test to last as TF may reserve the GPUs it uses. (SmartSim-PR458)
- Some actions in the current GitHub CI/CD workflows were outdated. They were replaced with the latest versions. (SmartSim-PR446)
- As the Cobalt workload manager is not used on any system we are aware of, its support in SmartSim was terminated and classes such as CobaltLauncher have been removed. (SmartSim-PR448)
- Experiment logs are written to a file that can be read by the dashboard. (SmartSim-PR452)
- Updated SmartSim's machine learning backends to PyTorch 2.0.1, Tensorflow 2.13.1, ONNX 1.14.1, and ONNX Runtime 1.16.1. As a result of this change, there is now an available ONNX wheel for use with Python 3.10, and wheels for all of SmartSim's machine learning backends with Python 3.11. (SmartSim-PR451) (SmartSim-PR461)
- The sphinx-tabs documentation extension uses a white background for the tabs component. A custom CSS for those components to inherit the overall theme color has been added. (SmartSim-PR453)
- Add concurrency groups to GitHub's CI/CD workflows, preventing multiple workflows from the same PR to be launched concurrently. (SmartSim-PR439)
- Torch changed their preferred indexing when trying to install their provided wheels. Updated the pip install command within smart build to ensure that the appropriate packages can be found. (SmartSim-PR449)
v0.6.0
Released on 18 December, 2023
Description
- Conflicting directives in the SmartSim packaging instructions were
fixed - sacct and
sstat errors are now fatal for
Slurm-based workflow executions - Added documentation section about ML features and TorchScript
- Added TorchScript functions to Online Analysis tutorial
- Added multi-DB example to documentation
- Improved test stability on HPC systems
- Added support for producing & consuming telemetry outputs
- Split tests into groups for parallel execution in CI/CD pipeline
- Change signature of
Experiment.summary() - Expose first_device parameter for scripts, functions, models
- Added support for MINBATCHTIMEOUT in model execution
- Remove support for RedisAI 1.2.5, use RedisAI 1.2.7 commit
- Add support for multiple databases
Detailed Notes
- Several conflicting directives between the
setup.py and the
setup.cfg were fixed to mitigate
warnings issued when building the pip wheel.
(SmartSim-PR435) - When the Slurm functions sacct and
sstat returned an error, it would be
ignored and SmartSim's state could become inconsistent. To prevent
this, errors raised by sacct or
sstat now result in an exception.
(SmartSim-PR392) - A section named ML Features was added to documentation. It
contains multiple examples of how ML models and functions can be
added to and executed on the DB. TorchScript-based post-processing
was added to the Online Analysis tutorial
(SmartSim-PR411) - An example of how to use multiple Orchestrators concurrently was
added to the documentation
(SmartSim-PR409) - The test infrastructure was improved. Tests on HPC system are now
stable, and issues such as non-stopped
Orchestrators or experiments created
in the wrong paths have been fixed
(SmartSim-PR381) - A telemetry monitor was added to check updates and produce events
for SmartDashboard
(SmartSim-PR426) - Split tests into group_a,
group_b,
slow_tests for parallel execution in
CI/CD pipeline
(SmartSim-PR417,
SmartSim-PR424) - Change format argument to
style in
Experiment.summary(), this is an API
break
(SmartSim-PR391) - Added support for first_device parameter for scripts, functions, and
models. This causes them to be loaded to the first num_devices
beginning with first_device
(SmartSim-PR394) - Added support for MINBATCHTIMEOUT in model execution, which caps the
delay waiting for a minimium number of model execution operations to
accumulate before executing them as a batch
(SmartSim-PR387) - RedisAI 1.2.5 is not supported anymore. The only RedisAI version is
now 1.2.7. Since the officially released RedisAI 1.2.7 has a bug
which breaks the build process on Mac OSX, it was decided to use
commit
634916c
from RedisAI's GitHub repository, where such bug has been fixed.
This applies to all operating systems.
(SmartSim-PR383) - Add support for creation of multiple databases with unique
identifiers.
(SmartSim-PR342)
v0.5.1
What's Changed
- Refactor smart cli into subparsers by @ankona in #308
- Fix Frontier code block in doc by @ashao in #321
- Update, Apply, and Automate Python Linting by @ankona in #311
- Avoid using
shell=True
by @ankona in #327 - fix for incorrect logging message format/args by @ankona in #330
- Alter launchers to pass env when starting a local step by @ankona in #329
- Raise error for inconsistent add_ml_model and add_script parameters by @juliaputko in #324
- Raising error for reserved keywords under function parameter options in get_allocation by @juliaputko in #325
- Fix bug in logging msg format string by @ankona in #332
- Log sacct failures by @ankona in #331
- Added PR324 and PR325 to changelog by @juliaputko in #333
- Add more tests for RAI_PATH and lib path interactions by @ankona in #328
- Enable mypy generic-related checks by @ankona in #338
- Fix colocated db preparation bug when using
JsrunSettings
by @ankona in #339 - Ensemble documentation update by @billschereriii in #322
- Mitigate suppressed protected-access errors from pylint by @ankona in #341
- Apply typehints to
smartsim._core.launcher.step.*
by @ankona in #334 - Add missing changelog entries by @ankona in #345
- Add support for Slurm heterogeneous jobs by @al-rigazzi in #346
- Remove ensemble generation from DB Object tests by @al-rigazzi in #349
- Integrate PalsMpiexecSettings into Experiment factory methods by @MattToast in #343
- Smart Info by @MattToast in #350
smart validate
should not hang when error in TF process by @MattToast in #351- Print attached files by @al-rigazzi in #352
- Update documentation surrounding contributions by @ashao in #344
- Remove references in docs to nonexistent CLI flag by @MattToast in #358
- Pretty print error message when onnx wheel not available by @MattToast in #359
- Update
cibuildwheel
version by @MattToast in #360 - Update changelog for release by @MattToast in #361
- Version Bump by @MattToast in #362
Full Changelog: v0.5.0...v0.5.1
v0.5.0
Released on 6 July 2023
Description
A full list of changes and detailed notes can be found below:
- Update SmartRedis dependency to v0.4.1
- Fix tests for db models and scripts
- Fix add_ml_model() and add_script() documentation, tests, and code
- Remove
requirements.txt
and other places where
dependencies were defined - Replace
limit_app_cpus
with
limit_db_cpus
for co-located orchestrators - Remove wait time associated with Experiment launch summary
- Update and rename Redis conf file
- Migrate from redis-py-cluster to redis-py
- Update full test suite to not require a TF wheel at test time
- Update doc strings
- Remove deprecated code
- Relax the coloredlogs version
- Update Fortran tutorials for SmartRedis
- Add support for multiple network interface binding in Orchestrator
and Colocated DBs - Add typehints and static analysis
Detailed notes
- Updates SmartRedis to the most current release (PR316)
- Fixes and enhancements to documentation (PR317, PR314, PR287)
- Various fixes and enhancements to the test suite
(PR315, PR312, PR310, PR302, PR283) - Fix a defect in the tests related to database models and scripts
that was causing key collisions when testing on workload managers (PR313) - Remove
requirements.txt
and other places where
dependencies were defined. (PR307) - Fix defect where dictionaries used to create run settings can be
changed unexpectedly due to copy-by-ref (PR305) - The underlying code for Model.add_ml_model() and Model.add_script()
was fixed to correctly handle multi-GPU configurations. Tests were
updated to run on non-local launchers. Documentation was updated and
fixed. Also, the default testing interface has been changed to lo
instead of ipogif. (PR304) - Typehints have been added. A makefile target
make check-mypy
executes static analysis with mypy. (PR295, [PR301].
(#301), PR303) - Replace
limit_app_cpus
with
limit_db_cpus
for co-located orchestrators. This
resolves some incorrect behavior/assumptions about how the
application would be pinned. Instead, users should directly specify
the binding options in their application using the options
appropriate for their launcher (PR306) - Simplify code in [random_permutations]{.title-ref} parameter
generation strategy (PR300) - Remove wait time associated with Experiment launch summary (PR298)
- Update Redis conf file to conform with Redis v7.0.5 conf file (PR293)
- Migrate from redis-py-cluster to redis-py for cluster status checks (PR292)
- Update full test suite to no longer require a tensorflow wheel to be available at test time. (PR291)
- Correct spelling of colocated in doc strings (PR290)
- Deprecated launcher-specific orchestrators, constants, and ML
utilities were removed. (PR289) - Relax the coloredlogs version to be greater than 10.0 (PR288)
- Update the Github Actions runner image from
macos-10.15
tomacos-12
. The former
began deprecation in May 2022 and was finally removed in May 2023. (PR285) - The Fortran tutorials had not been fully updated to show how to
handle return/error codes. These have now all been updated. (PR284) - Orchestrator and Colocated DB now accept a list of interfaces to
bind to. The argument name is stillinterface
for
backward compatibility reasons. (PR281) - Typehints have been added to public APIs. A makefile target to
execute static analysis with mypy is availablemake check-mypy
(PR295)
v0.4.2
Released on April 12, 2023
Description
This release of SmartSim had a focus on polishing and extending exiting
features already provided by SmartSim. Most notably, this release
provides support to allow users to colocate their models with an
orchestrator using Unix domain sockets and support for launching models
as batch jobs.
Additionally, SmartSim has updated its tool chains to provide a better
user experience. Notably, SmarSim can now be used with Python 3.10,
Redis 7.0.5, and RedisAI 1.2.7. Furthermore, SmartSim now utilizes
SmartRedis's aggregation lists to streamline the use and extension of
ML data loaders, making working with popular machine learning frameworks
in SmartSim a breeze.
A full list of changes and detailed notes can be found below:
- Add support for colocating an orchestrator over UDS
- Add support for Python 3.10, deprecate support for Python 3.7 and
RedisAI 1.2.3 - Drop support for Ray
- Update ML data loaders to make use of SmartRedis's aggregation
lists - Allow for models to be launched independently as batch jobs
- Update to current version of Redis to 7.0.5
- Add support for RedisAI 1.2.7, pyTorch 1.11.0, Tensorflow 2.8.0,
ONNXRuntime 1.11.1 - Fix bug in colocated database entrypoint when loading PyTorch models
- Fix test suite behavior with environment variables
Detailed Notes
- Running some tests could result in some SmartSim-specific
environment variables to be set. Such environment variables are now
reset after each test execution. Also, a warning for environment
variable usage in Slurm was added, to make the user aware in case an
environment variable will not be assigned the desired value with
[--export]{.title-ref}.
(PR270) - The PyTorch and TensorFlow data loaders were update to make use of
aggregation lists. This breaks their API, but makes them easier to
use. (PR264) - The support for Ray was dropped, as its most recent versions caused
problems when deployed through SmartSim. We plan to release a
separate add-on library to accomplish the same results. If you are
interested in getting the Ray launch functionality back in your
workflow, please get in touch with us!
(PR263) - Update from Redis version 6.0.8 to 7.0.5.
(PR258) - Adds support for Python 3.10 without the ONNX machine learning
backend. Deprecates support for Python 3.7 as it will stop receiving
security updates. Deprecates support for RedisAI 1.2.3. Update the
build process to be able to correctly fetch supported dependencies.
If a user attempts to build an unsupported dependency, an error
message is shown highlighting the discrepancy.
(PR256) - Models were given a [batch_settings]{.title-ref} attribute. When
launching a model through [Experiment.start]{.title-ref} the
[Experiment]{.title-ref} will first check for a non-nullish value at
that attribute. If the check is satisfied, the
[Experiment]{.title-ref} will attempt to wrap the underlying run
command in a batch job using the object referenced at
[Model.batch_settings]{.title-ref} as the batch settings for the
job. If the check is not satisfied, the [Model]{.title-ref} is
launched in the traditional manner as a job step.
(PR245) - Fix bug in colocated database entrypoint stemming from uninitialized
variables. This bug affects PyTorch models being loaded into the
database. (PR237) - The release of RedisAI 1.2.7 allows us to update support for recent
versions of PyTorch, Tensorflow, and ONNX
(PR234) - Make installation of correct Torch backend more reliable according
to instruction from PyTorch - In addition to TCP, add UDS support for colocating an orchestrator
with models. Methods [Model.colocate_db_tcp]{.title-ref} and
[Model.colocate_db_uds]{.title-ref} were added to expose this
functionality. The [Model.colocate_db]{.title-ref} method remains
and uses TCP for backward compatibility
(PR246)
v0.4.1
Released on June 24, 2022
Description: This release of SmartSim introduces a new experimental feature to help make SmartSim workflows more portable: the ability to run simulations models in a container via Singularity. This feature has been tested on a small number of platforms and we encourage users to provide feedback on its use.
We have also made improvements in a variety of areas: new utilities to load scripts and machine learning models into the database directly from SmartSim driver scripts and install-time choice to use either KeyDB
or Redis
for the Orchestrator. The RunSettings
API is now more consistent across subclasses. Another key focus of this release was to aid new SmartSim users by including more extensive tutorials and improving the documentation. The docker image containing the SmartSim tutorials now also includes a tutorial on online training.
Launcher improvements
- New methods for specifying
RunSettings
parameters (SmartSim-PR166) (SmartSim-PR170)- Better support for
mpirun
,mpiexec
, andorterun
as launchers (SmartSim-PR186)- Experimental: add support for running models via Singularity (SmartSim-PR204)
Documentation and tutorials
- Tutorial updates (SmartSim-PR155) (SmartSim-PR203) (SmartSim-PR208)
- Add SmartSim Zoo info to documentation (SmartSim-PR175)
- New tutorial for demonstrating online training (SmartSim-PR176) (SmartSim-PR188)
General improvements and bug fixes
- Set models and scripts at the driver level (SmartSim-PR185)
- Optionally use KeyDB for the orchestrator (SmartSim-PR180)
- Ability to specify system-level libraries (SmartSim-PR154) (SmartSim-PR182)
- Fix the handling of LSF
gpus_per_shard
(SmartSim-PR164)- Fix error when re-running
smart build
(SmartSim-PR165)- Fix generator hanging when tagged configuration variables are missing (SmartSim-PR177)
Dependency updates
- CMake version from 3.10 to 3.13 (SmartSim-PR152)
- Update click to 8.0.2 (SmartSim-PR200)
v0.4.0
Released on Feb 11, 2022
Description: In this release SmartSim continues to promote ease of use. To this end SmartSim has introduced new portability features that allow users to abstract away their targeted hardware, while providing even more compatibility with existing libraries.
A new feature, Co-located orchestrator deployments has been added which provides scalable online inference capabilities that overcome previous performance limitations in separated orchestrator/application deployments. For more information on advantages of co-located deployments, see the Orchestrator section of the SmartSim documentation.
The SmartSim build was significantly improved to increase customization of build toolchain and the smart
command line interface was expanded.
Additional tweaks and upgrades have also been made to ensure an optimal experience. Here is a comprehensive list of changes made in SmartSim 0.4.0.
Orchestrator Enhancements:
Emphasize Driver Script Portability:
- Add ability to create run settings through an experiment (PR110)
- Add ability to create batch settings through an experiment (PR112)
- Add automatic launcher detection to experiment portability functions (PR120)
Expand Machine Learning Library Support:
- Data loaders for online training in Keras/TF and Pytorch (PR115)(PR140)
- ML backend versions updated with expanded support for multiple versions (PR122)
- Launch Ray internally using
RunSettings
(PR118) - Add Ray cluster setup and deployment to SmartSim (PR50)
Expand Launcher Setting Options:
- Add ability to use base
RunSettings
on a Slurm, PBS, or Cobalt launchers (PR90) - Add ability to use base
RunSettings
on LFS launcher (PR108)
Deprecations and Breaking Changes
- Orchestrator classes combined into single implementation for portability (PR139)
smartsim.constants
changed tosmartsim.status
(PR122)smartsim.tf
migrated tosmartsim.ml.tf
(PR115)(PR140)- TOML configuration option removed in favor of environment variable approach (PR122)
General Improvements and Bug Fixes:
- Improve and extend parameter handling (PR107)(PR119)
- Abstract away non-user facing implementation details (PR122)
- Add various dimensions to the CI build matrix for SmartSim testing (PR130)
- Add missing functions to LSFSettings API (PR113)
- Add RedisAI checker for installed backends (PR137)
- Remove heavy and unnecessary dependencies (PR116)(PR132)
- Fix LSFLauncher and LSFOrchestrator(PR86)
- Fix over greedy Workload Manager Parsers (PR95)
- Fix Slurm handling of comma-separated env vars (PR104)
- Fix internal method calls (PR138)
Documentation Updates: