Skip to content

Releases: CrayLabs/SmartSim

v0.8.0

27 Sep 21:42
528c1ae
Compare
Choose a tag to compare

Released on 27 September, 2024

What's Changed

Features

  • Implement support for SGE by @ashao in #610
  • Add ability to specify hardware policies on dragon run requests by @ankona in #638
  • Added type checking to params on model by @juliaputko in #676

Bug Fixes

  • Fix build error caused by use of deprecated pkg_resources by @ankona in #598
  • Mitigate dragon/numpy, mypy/typing_extension dependency issues by @ankona in #653

Build Improvements

Miscellaneous Improvements

  • Update tutorials and tutorial containers by @al-rigazzi in #589
  • Fix util-tests outputs appearing in root directory by @ankona in #614
  • Restrict to numpy 1.x by @ashao in #623
  • Remove broken redis documentation links by @ankona in #627
  • More easily discoverable dependencies by @ashao in #635
  • Update codecov to 4.5.0 by @mellis13 in #657
  • Pin watchdog version to prevent mypy errors by @ashao in #690
  • Refine install documentation for Perlmutter and Frontier by @ashao in #717
  • Change 'conda activate' to 'source activate' for Frontier by @ashao in #719
  • Make a user-specific db cache by @ashao in #727
  • Update release action to remove CI Build Wheel by @MattToast in #728

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

15 May 00:40
5039699
Compare
Choose a tag to compare

Released on 14 May, 2024

What's Changed

Features

Bug Fixes

API Breaks

Miscellaneous Improvements

Full Changelog: v0.6.2...v0.7.0

v0.6.2

16 Feb 07:54
5444927
Compare
Choose a tag to compare

0.6.2

Released on 16 February, 2024

Description

  • Patch SmartSim dependency version

Detailed Notes

  • A critical performance concern was identified and addressed in SmartRedis. A patch fix was deployed, and SmartSim was updated to ensure users do not inadvertently pull the unpatched version of SmartRedis. (SmartSim-PR493)

v0.6.1

15 Feb 23:18
8b742ec
Compare
Choose a tag to compare

Released on 15 February, 2024

Description

  • Duplicate for DBModel/Script prevented
  • Update license to include 2024
  • Telemetry monitor is now active by default
  • Add support for Mac OSX on Apple Silicon
  • Remove Torch warnings during testing
  • Validate Slurm timing format
  • Expose Python Typehints
  • Fix test_logs to prevent generation of directory
  • Fix Python Typehint for colocated database settings
  • Python 3.11 Support
  • Quality of life smart validate improvements
  • Remove Cobalt support
  • Enrich logging through context variables
  • Upgrade Machine Learning dependencies
  • Override sphinx-tabs background color
  • Add concurrency group to test workflow
  • Fix index when installing torch through smart build

Detailed Notes

  • Modify the git clone for both Redis and RedisAI to set the line endings to unix-style line endings when using MacOS on ARM. (SmartSim-PR482)
  • Separate install instructions are now provided for Mac OSX on x64 vs ARM64 (SmartSim-PR479)
  • Prevent duplicate ML model and script names being added to an Ensemble member if the names exists. (SmartSim-PR475)
  • Updates Copyright (c) 2021-2023 to Copyright (c) 2021-2024 in all of the necessary files. (SmartSim-PR485)
  • Bug fix which prevents the expected behavior when the SMARTSIM_LOG_LEVEL environment variable was set to developer. (SmartSim-PR473)
  • Sets the default value of the "enable telemetry" flag to on. Bumps the output manifest.json version number to match that of smartdashboard and pins a watchdog version to avoid build errors. (SmartSim-PR477)
  • Refactor logic of Manifest.has_db_objects to remove excess branching and improve readability/maintainability. (SmartSim-PR476)
  • SmartSim can now be built and used on platforms using Apple Silicon (ARM64). Currently, only the PyTorch backend is supported. Note that libtorch will be downloaded from a CrayLabs github repo. (SmartSim-PR465)
  • Tests that were saving Torch models were emitting warnings. These warnings were addressed by updating the model save test function. (SmartSim-PR472)
  • Validate the timing format when requesting a slurm allocation. (SmartSim-PR471)
  • Add and ship py.typed marker to expose inline type hints. Fix type errors related to SmartRedis. (SmartSim-PR468)
  • Fix the test_logs.py::test_context_leak test that was erroneously creating a directory named some value in SmartSim's root directory. (SmartSim-PR467)
  • Add Python type hinting to colocated settings. (SmartSim-PR462)
  • Add github actions for running black and isort checks. (SmartSim-PR464)
  • Relax the required version of typing_extensions. (SmartSim-PR459)
  • Addition of Python 3.11 to SmartSim. (SmartSim-PR461)
  • Quality of life smart validate improvements such as setting CUDA_VISIBLE_DEVICES environment variable within smart validate prior to importing any ML deps to prevent false negatives on multi-GPU systems. Additionally, move SmartRedis logs from standard out to dedicated log file in the validation temporary directory as well as suppress sklearn deprecation warning by pinning KMeans constructor argument. Lastly, move TF test to last as TF may reserve the GPUs it uses. (SmartSim-PR458)
  • Some actions in the current GitHub CI/CD workflows were outdated. They were replaced with the latest versions. (SmartSim-PR446)
  • As the Cobalt workload manager is not used on any system we are aware of, its support in SmartSim was terminated and classes such as CobaltLauncher have been removed. (SmartSim-PR448)
  • Experiment logs are written to a file that can be read by the dashboard. (SmartSim-PR452)
  • Updated SmartSim's machine learning backends to PyTorch 2.0.1, Tensorflow 2.13.1, ONNX 1.14.1, and ONNX Runtime 1.16.1. As a result of this change, there is now an available ONNX wheel for use with Python 3.10, and wheels for all of SmartSim's machine learning backends with Python 3.11. (SmartSim-PR451) (SmartSim-PR461)
  • The sphinx-tabs documentation extension uses a white background for the tabs component. A custom CSS for those components to inherit the overall theme color has been added. (SmartSim-PR453)
  • Add concurrency groups to GitHub's CI/CD workflows, preventing multiple workflows from the same PR to be launched concurrently. (SmartSim-PR439)
  • Torch changed their preferred indexing when trying to install their provided wheels. Updated the pip install command within smart build to ensure that the appropriate packages can be found. (SmartSim-PR449)

v0.6.0

18 Dec 21:07
9d97397
Compare
Choose a tag to compare

Released on 18 December, 2023

Description

  • Conflicting directives in the SmartSim packaging instructions were
    fixed
  • sacct and
    sstat errors are now fatal for
    Slurm-based workflow executions
  • Added documentation section about ML features and TorchScript
  • Added TorchScript functions to Online Analysis tutorial
  • Added multi-DB example to documentation
  • Improved test stability on HPC systems
  • Added support for producing & consuming telemetry outputs
  • Split tests into groups for parallel execution in CI/CD pipeline
  • Change signature of
    Experiment.summary()
  • Expose first_device parameter for scripts, functions, models
  • Added support for MINBATCHTIMEOUT in model execution
  • Remove support for RedisAI 1.2.5, use RedisAI 1.2.7 commit
  • Add support for multiple databases

Detailed Notes

  • Several conflicting directives between the
    setup.py and the
    setup.cfg were fixed to mitigate
    warnings issued when building the pip wheel.
    (SmartSim-PR435)
  • When the Slurm functions sacct and
    sstat returned an error, it would be
    ignored and SmartSim's state could become inconsistent. To prevent
    this, errors raised by sacct or
    sstat now result in an exception.
    (SmartSim-PR392)
  • A section named ML Features was added to documentation. It
    contains multiple examples of how ML models and functions can be
    added to and executed on the DB. TorchScript-based post-processing
    was added to the Online Analysis tutorial
    (SmartSim-PR411)
  • An example of how to use multiple Orchestrators concurrently was
    added to the documentation
    (SmartSim-PR409)
  • The test infrastructure was improved. Tests on HPC system are now
    stable, and issues such as non-stopped
    Orchestrators or experiments created
    in the wrong paths have been fixed
    (SmartSim-PR381)
  • A telemetry monitor was added to check updates and produce events
    for SmartDashboard
    (SmartSim-PR426)
  • Split tests into group_a,
    group_b,
    slow_tests for parallel execution in
    CI/CD pipeline
    (SmartSim-PR417,
    SmartSim-PR424)
  • Change format argument to
    style in
    Experiment.summary(), this is an API
    break
    (SmartSim-PR391)
  • Added support for first_device parameter for scripts, functions, and
    models. This causes them to be loaded to the first num_devices
    beginning with first_device
    (SmartSim-PR394)
  • Added support for MINBATCHTIMEOUT in model execution, which caps the
    delay waiting for a minimium number of model execution operations to
    accumulate before executing them as a batch
    (SmartSim-PR387)
  • RedisAI 1.2.5 is not supported anymore. The only RedisAI version is
    now 1.2.7. Since the officially released RedisAI 1.2.7 has a bug
    which breaks the build process on Mac OSX, it was decided to use
    commit
    634916c
    from RedisAI's GitHub repository, where such bug has been fixed.
    This applies to all operating systems.
    (SmartSim-PR383)
  • Add support for creation of multiple databases with unique
    identifiers.
    (SmartSim-PR342)

v0.5.1

15 Sep 15:43
e1a5783
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.5.0...v0.5.1

v0.5.0

07 Jul 00:26
a9018e0
Compare
Choose a tag to compare

Released on 6 July 2023

Description

A full list of changes and detailed notes can be found below:

  • Update SmartRedis dependency to v0.4.1
  • Fix tests for db models and scripts
  • Fix add_ml_model() and add_script() documentation, tests, and code
  • Remove requirements.txt and other places where
    dependencies were defined
  • Replace limit_app_cpus with
    limit_db_cpus for co-located orchestrators
  • Remove wait time associated with Experiment launch summary
  • Update and rename Redis conf file
  • Migrate from redis-py-cluster to redis-py
  • Update full test suite to not require a TF wheel at test time
  • Update doc strings
  • Remove deprecated code
  • Relax the coloredlogs version
  • Update Fortran tutorials for SmartRedis
  • Add support for multiple network interface binding in Orchestrator
    and Colocated DBs
  • Add typehints and static analysis

Detailed notes

  • Updates SmartRedis to the most current release (PR316)
  • Fixes and enhancements to documentation (PR317, PR314, PR287)
  • Various fixes and enhancements to the test suite
    (PR315, PR312, PR310, PR302, PR283)
  • Fix a defect in the tests related to database models and scripts
    that was causing key collisions when testing on workload managers (PR313)
  • Remove requirements.txt and other places where
    dependencies were defined. (PR307)
  • Fix defect where dictionaries used to create run settings can be
    changed unexpectedly due to copy-by-ref (PR305)
  • The underlying code for Model.add_ml_model() and Model.add_script()
    was fixed to correctly handle multi-GPU configurations. Tests were
    updated to run on non-local launchers. Documentation was updated and
    fixed. Also, the default testing interface has been changed to lo
    instead of ipogif. (PR304)
  • Typehints have been added. A makefile target make check-mypy
    executes static analysis with mypy. (PR295, [PR301].
    (#301), PR303)
  • Replace limit_app_cpus with
    limit_db_cpus for co-located orchestrators. This
    resolves some incorrect behavior/assumptions about how the
    application would be pinned. Instead, users should directly specify
    the binding options in their application using the options
    appropriate for their launcher (PR306)
  • Simplify code in [random_permutations]{.title-ref} parameter
    generation strategy (PR300)
  • Remove wait time associated with Experiment launch summary (PR298)
  • Update Redis conf file to conform with Redis v7.0.5 conf file (PR293)
  • Migrate from redis-py-cluster to redis-py for cluster status checks (PR292)
  • Update full test suite to no longer require a tensorflow wheel to be available at test time. (PR291)
  • Correct spelling of colocated in doc strings (PR290)
  • Deprecated launcher-specific orchestrators, constants, and ML
    utilities were removed. (PR289)
  • Relax the coloredlogs version to be greater than 10.0 (PR288)
  • Update the Github Actions runner image from
    macos-10.15 to macos-12. The former
    began deprecation in May 2022 and was finally removed in May 2023. (PR285)
  • The Fortran tutorials had not been fully updated to show how to
    handle return/error codes. These have now all been updated. (PR284)
  • Orchestrator and Colocated DB now accept a list of interfaces to
    bind to. The argument name is still interface for
    backward compatibility reasons. (PR281)
  • Typehints have been added to public APIs. A makefile target to
    execute static analysis with mypy is available make check-mypy (PR295)

v0.4.2

12 Apr 20:48
7e04b09
Compare
Choose a tag to compare

Released on April 12, 2023

Description

This release of SmartSim had a focus on polishing and extending exiting
features already provided by SmartSim. Most notably, this release
provides support to allow users to colocate their models with an
orchestrator using Unix domain sockets and support for launching models
as batch jobs.

Additionally, SmartSim has updated its tool chains to provide a better
user experience. Notably, SmarSim can now be used with Python 3.10,
Redis 7.0.5, and RedisAI 1.2.7. Furthermore, SmartSim now utilizes
SmartRedis's aggregation lists to streamline the use and extension of
ML data loaders, making working with popular machine learning frameworks
in SmartSim a breeze.

A full list of changes and detailed notes can be found below:

  • Add support for colocating an orchestrator over UDS
  • Add support for Python 3.10, deprecate support for Python 3.7 and
    RedisAI 1.2.3
  • Drop support for Ray
  • Update ML data loaders to make use of SmartRedis's aggregation
    lists
  • Allow for models to be launched independently as batch jobs
  • Update to current version of Redis to 7.0.5
  • Add support for RedisAI 1.2.7, pyTorch 1.11.0, Tensorflow 2.8.0,
    ONNXRuntime 1.11.1
  • Fix bug in colocated database entrypoint when loading PyTorch models
  • Fix test suite behavior with environment variables

Detailed Notes

  • Running some tests could result in some SmartSim-specific
    environment variables to be set. Such environment variables are now
    reset after each test execution. Also, a warning for environment
    variable usage in Slurm was added, to make the user aware in case an
    environment variable will not be assigned the desired value with
    [--export]{.title-ref}.
    (PR270)
  • The PyTorch and TensorFlow data loaders were update to make use of
    aggregation lists. This breaks their API, but makes them easier to
    use. (PR264)
  • The support for Ray was dropped, as its most recent versions caused
    problems when deployed through SmartSim. We plan to release a
    separate add-on library to accomplish the same results. If you are
    interested in getting the Ray launch functionality back in your
    workflow, please get in touch with us!
    (PR263)
  • Update from Redis version 6.0.8 to 7.0.5.
    (PR258)
  • Adds support for Python 3.10 without the ONNX machine learning
    backend. Deprecates support for Python 3.7 as it will stop receiving
    security updates. Deprecates support for RedisAI 1.2.3. Update the
    build process to be able to correctly fetch supported dependencies.
    If a user attempts to build an unsupported dependency, an error
    message is shown highlighting the discrepancy.
    (PR256)
  • Models were given a [batch_settings]{.title-ref} attribute. When
    launching a model through [Experiment.start]{.title-ref} the
    [Experiment]{.title-ref} will first check for a non-nullish value at
    that attribute. If the check is satisfied, the
    [Experiment]{.title-ref} will attempt to wrap the underlying run
    command in a batch job using the object referenced at
    [Model.batch_settings]{.title-ref} as the batch settings for the
    job. If the check is not satisfied, the [Model]{.title-ref} is
    launched in the traditional manner as a job step.
    (PR245)
  • Fix bug in colocated database entrypoint stemming from uninitialized
    variables. This bug affects PyTorch models being loaded into the
    database. (PR237)
  • The release of RedisAI 1.2.7 allows us to update support for recent
    versions of PyTorch, Tensorflow, and ONNX
    (PR234)
  • Make installation of correct Torch backend more reliable according
    to instruction from PyTorch
  • In addition to TCP, add UDS support for colocating an orchestrator
    with models. Methods [Model.colocate_db_tcp]{.title-ref} and
    [Model.colocate_db_uds]{.title-ref} were added to expose this
    functionality. The [Model.colocate_db]{.title-ref} method remains
    and uses TCP for backward compatibility
    (PR246)

v0.4.1

25 Jun 00:32
e804ad0
Compare
Choose a tag to compare

Released on June 24, 2022

Description: This release of SmartSim introduces a new experimental feature to help make SmartSim workflows more portable: the ability to run simulations models in a container via Singularity. This feature has been tested on a small number of platforms and we encourage users to provide feedback on its use.

We have also made improvements in a variety of areas: new utilities to load scripts and machine learning models into the database directly from SmartSim driver scripts and install-time choice to use either KeyDB or Redis for the Orchestrator. The RunSettings API is now more consistent across subclasses. Another key focus of this release was to aid new SmartSim users by including more extensive tutorials and improving the documentation. The docker image containing the SmartSim tutorials now also includes a tutorial on online training.

Launcher improvements

Documentation and tutorials

General improvements and bug fixes

Dependency updates

v0.4.0

12 Feb 23:17
Compare
Choose a tag to compare

Released on Feb 11, 2022

Description: In this release SmartSim continues to promote ease of use. To this end SmartSim has introduced new portability features that allow users to abstract away their targeted hardware, while providing even more compatibility with existing libraries.

A new feature, Co-located orchestrator deployments has been added which provides scalable online inference capabilities that overcome previous performance limitations in separated orchestrator/application deployments. For more information on advantages of co-located deployments, see the Orchestrator section of the SmartSim documentation.

The SmartSim build was significantly improved to increase customization of build toolchain and the smart command line interface was expanded.

Additional tweaks and upgrades have also been made to ensure an optimal experience. Here is a comprehensive list of changes made in SmartSim 0.4.0.

Orchestrator Enhancements:

  • Add Orchestrator Co-location (PR139)
  • Add Orchestrator configuration file edit methods (PR109)

Emphasize Driver Script Portability:

  • Add ability to create run settings through an experiment (PR110)
  • Add ability to create batch settings through an experiment (PR112)
  • Add automatic launcher detection to experiment portability functions (PR120)

Expand Machine Learning Library Support:

  • Data loaders for online training in Keras/TF and Pytorch (PR115)(PR140)
  • ML backend versions updated with expanded support for multiple versions (PR122)
  • Launch Ray internally using RunSettings (PR118)
  • Add Ray cluster setup and deployment to SmartSim (PR50)

Expand Launcher Setting Options:

  • Add ability to use base RunSettings on a Slurm, PBS, or Cobalt launchers (PR90)
  • Add ability to use base RunSettings on LFS launcher (PR108)

Deprecations and Breaking Changes

  • Orchestrator classes combined into single implementation for portability (PR139)
  • smartsim.constants changed to smartsim.status (PR122)
  • smartsim.tf migrated to smartsim.ml.tf (PR115)(PR140)
  • TOML configuration option removed in favor of environment variable approach (PR122)

General Improvements and Bug Fixes:

  • Improve and extend parameter handling (PR107)(PR119)
  • Abstract away non-user facing implementation details (PR122)
  • Add various dimensions to the CI build matrix for SmartSim testing (PR130)
  • Add missing functions to LSFSettings API (PR113)
  • Add RedisAI checker for installed backends (PR137)
  • Remove heavy and unnecessary dependencies (PR116)(PR132)
  • Fix LSFLauncher and LSFOrchestrator(PR86)
  • Fix over greedy Workload Manager Parsers (PR95)
  • Fix Slurm handling of comma-separated env vars (PR104)
  • Fix internal method calls (PR138)

Documentation Updates: