Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Switch CompilerEnv to the new dataset API. #222

Merged

Conversation

ChrisCummins
Copy link
Contributor

@ChrisCummins ChrisCummins commented Apr 27, 2021

This switches over the CompilerEnv environment to use the new dataset API, dropping the LegacyDataset class.

Disclaimer: Sorry for the large diff :-( Two reasons: (1) renaming cBench to cbench causes a lot of churn, (2) the datasets API touches nearly everything, and the switch from old to new API should be atomic.

Background

Since the very first prototype of CompilerGym, a Benchmark protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend:

message Benchmark {
// The name of the benchmark to add. In case of conflict with an existing
// benchmark, this new benchmark replaces the existing one.
string uri = 1;
// The description of the program that is being compiled. It is up to the
// service to determine how to interpret this file, and it is the
// responsibility of the client to ensure that it provides the correct format.
// For example, the service could expect that this file contains serialized
// IR data, or an input source file.
File program = 2;
}

Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a GetBenchmarks() RPC method, and allowing new benchmarks to be added using an AddBenchmarks() method.

This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a datasets abstraction was added.

Initial Datasets abstraction

To add support for managing large sets of programs, a Dataset tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier:

"""Manage datasets of benchmarks.
.. code-block::
$ python -m compiler_gym.bin.datasets --env=<env> [command...]
Where :code:`command` is one of :code:`--download=<dataset...>`,
:code:`--activate=<dataset...>`, :code:`--deactivate=<dataset...>`,
and :code:`--delete=<dataset...>`.
Listing installed datasets
--------------------------
If run with no arguments, this command shows an overview of the datasets that
are activate, inactive, and available to download. For example:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0
llvm-v0 benchmarks site dir: /home/user/.local/share/compiler_gym/llvm/10.0.0/bitcode_benchmarks
+-------------------+--------------+-----------------+----------------+
| Active Datasets | License | #. Benchmarks | Size on disk |
+===================+==============+=================+================+
| cBench-v1 | BSD 3-Clause | 23 | 10.1 MB |
+-------------------+--------------+-----------------+----------------+
| Total | | 23 | 10.1 MB |
+-------------------+--------------+-----------------+----------------+
These benchmarks are ready for use. Deactivate them using `--deactivate=<name>`.
+---------------------+-----------+-----------------+----------------+
| Inactive Datasets | License | #. Benchmarks | Size on disk |
+=====================+===========+=================+================+
| Total | | 0 | 0 Bytes |
+---------------------+-----------+-----------------+----------------+
These benchmarks may be activated using `--activate=<name>`.
+------------------------+---------------------------------+-----------------+----------------+
| Downloadable Dataset | License | #. Benchmarks | Size on disk |
+========================+=================================+=================+================+
| blas-v0 | BSD 3-Clause | 300 | 4.0 MB |
+------------------------+---------------------------------+-----------------+----------------+
| polybench-v0 | BSD 3-Clause | 27 | 162.6 kB |
+------------------------+---------------------------------+-----------------+----------------+
These benchmarks may be installed using `--download=<name> --activate=<name>`.
Downloading datasets
--------------------
Use :code:`--download` to download a dataset from the list of available datasets:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=npb-v0
After downloading, the dataset will be activated and the benchmarks will be
available to use by the environment.
>>> import compiler_gym
>>> import gym
>>> env = gym.make("llvm-v0")
>>> env.benchmark = "npb-v0"
The flag :code:`--download_all` can be used to download every available dataset:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download_all
:code:`--download` accepts the URL of any :code:`.tar.bz2` file to support custom datasets:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=https://example.com/dataset.tar.bz2
Or use the :code:`file:///` URI to install a local archive file:
.. code-block::
$ python -m compiler_gym.bin.benchmarks --env=llvm-v0 --download=file:////tmp/dataset.tar.bz2
The list of datasets that are available to download may be extended by calling
:meth:`CompilerEnv.register_dataset() <compiler_gym.envs.CompilerEnv.register_dataset>`
on a :code:`CompilerEnv` instance.
To programmatically download datasets, see
:meth:`CompilerEnv.require_dataset() <compiler_gym.envs.CompilerEnv.require_dataset>`.
Activating and deactivating datasets
------------------------------------
Datasets have two states: active and inactive. An inactive dataset still exists
locally on the filesystem, but is excluded from use by CompilerGym environments.
This be useful if you have many datasets downloaded and you would to limit the
benchmarks that can be selected randomly by an environment.
Activate or deactivate datasets using the :code:`--activate` and :code:`--deactivate`
flags, respectively:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate=npb-v0,github-v0 --deactivate=cBench-v1
The :code:`--activate_all` and :code:`--deactivate_all` flags can be used as a
shortcut to activate or deactivate every downloaded:
.. code-block::
# Activate all inactivate datasets:
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate_all
# Make all activate datasets inactive:
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --deactivate_all
Deleting datasets
-----------------
To remove a dataset from the filesystem, use :code:`--delete`:
.. code-block::
$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --delete=npb-v0
Once deleted, a dataset must be downloaded before it can be used again.
A :code:`--delete_all` flag can be used to delete all of the locally installed
datasets.
"""

Problems with this approach

  • Leaky abstraction Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync.
  • Inflexible Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc.
  • O(n) space and time overhead on each service instance, where n is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem.

New Dataset API

This pull request changes the ownership model so that the Environment owns the benchmarks and datasets, not the service. This uses the new Dataset class hierarchy that has been added in previous pull requests: #190, #191, #192, #200, #201.

Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a Benchmark proto. The service is free to cache this for future use, and can empty the cache whenever it wants.

This new approach has a few key benefits:

  • By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets.
  • Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep.
  • Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks.
  • Adds support for lazily-compiled datasets of programs that are generated on-demand.
  • Removes the need to download datasets ahead of time. Datasets can now be installed on-demand.

Summary of changes

  • Changes the type of env.benchmark from a string to a Benchmark instance.
  • Makes env.benchmark a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use env.datasets.benchmark().
  • env.fork() no longer requires env.reset() to have been called first. It will call env.reset() if required.
  • env.benchmark = None is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using env.datasets.benchmark_uris() and similar.
  • Deprecates all LegacyDataset operations, changing their behavior to no-ops, and removing the class.
  • Renames cBench to cbench to be consistent with the lower-case naming convention of gym. The old cBench datasets are kept around but are marked deprecated to encourage migration.

Migrating to the new interface

To migrate existing code to the new interface:

  1. Update references to cBench-v[01] to cbench-v1.
  2. Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required.
  3. Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified.
  4. Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided.
  5. Remove calls to env.require_dataset().

Performance impact

A comparison of benchmarks/bench_test.py on current development vs this PR:

---------------------------------------------------- benchmark 'test_fork[llvm;fast-benchmark]': 2 tests -----------------------------------------------------
Name (time in ms)                                    Min            Median               Max              Mean            StdDev                 OPS          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fork[llvm;fast-benchmark] (development-)     4.7868 (1.08)     5.2404 (1.01)     8.8085 (1.21)     5.2855 (1.01)     0.5359 (1.12)     189.1979 (0.99)   
test_fork[llvm;fast-benchmark] (new-dataset-)     4.4304 (1.0)      5.2111 (1.0)      7.2661 (1.0)      5.2436 (1.0)      0.4801 (1.0)      190.7098 (1.0)    
--------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_fork[llvm;slow-benchmark]': 2 tests -------------------------------------------------------
Name (time in ms)                                     Min             Median                 Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fork[llvm;slow-benchmark] (development-)     76.2420 (1.0)      85.9890 (1.07)     125.7991 (1.36)     86.5312 (1.07)     5.5715 (2.18)     11.5565 (0.93)   
test_fork[llvm;slow-benchmark] (new-dataset-)     76.9690 (1.01)     80.1162 (1.0)       92.2011 (1.0)      80.7325 (1.0)      2.5577 (1.0)      12.3866 (1.0)    
------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[dummy-cc]': 2 tests -------------------------------------------------------
Name (time in ms)                                 Min              Median                 Max                Mean            StdDev               OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[dummy-cc] (development-)     109.6519 (1.00)     110.1575 (1.0)      112.0789 (1.00)     110.2666 (1.0)      0.4347 (1.0)      9.0689 (1.0)    
test_make_local[dummy-cc] (new-dataset-)     109.5566 (1.0)      110.3070 (1.00)     111.9780 (1.0)      110.4062 (1.00)     0.4862 (1.12)     9.0575 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[dummy-py]': 2 tests -------------------------------------------------------
Name (time in ms)                                 Min              Median                 Max                Mean            StdDev               OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[dummy-py] (development-)     555.5705 (1.00)     558.3913 (1.00)     566.5027 (1.00)     559.0482 (1.00)     2.2833 (1.05)     1.7888 (1.00)   
test_make_local[dummy-py] (new-dataset-)     554.8298 (1.0)      557.6217 (1.0)      566.4247 (1.0)      558.3312 (1.0)      2.1698 (1.0)      1.7911 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[llvm]': 2 tests -------------------------------------------------------
Name (time in ms)                             Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[llvm] (development-)     113.2592 (1.01)     114.1064 (1.00)     145.3874 (1.02)     114.5627 (1.0)      3.1670 (1.0)      8.7288 (1.0)    
test_make_local[llvm] (new-dataset-)     112.6326 (1.0)      113.9273 (1.0)      142.5838 (1.0)      114.6359 (1.00)     3.9459 (1.25)     8.7233 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[dummy-cc]': 2 tests -------------------------------------------------------
Name (time in ms)                                   Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[dummy-cc] (development-)     201.4110 (1.0)      202.2620 (1.0)      203.0277 (1.0)      202.2196 (1.0)      0.3515 (1.0)      4.9451 (1.0)    
test_make_service[dummy-cc] (new-dataset-)     201.4547 (1.00)     202.3753 (1.00)     203.0454 (1.00)     202.3238 (1.00)     0.3919 (1.11)     4.9426 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[dummy-py]': 2 tests -------------------------------------------------------
Name (time in ms)                                   Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[dummy-py] (development-)     201.3660 (1.0)      202.2194 (1.00)     203.1361 (1.00)     202.2493 (1.00)     0.4470 (1.22)     4.9444 (1.00)   
test_make_service[dummy-py] (new-dataset-)     201.4377 (1.00)     202.1623 (1.0)      203.0117 (1.0)      202.1713 (1.0)      0.3653 (1.0)      4.9463 (1.0)    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[llvm]': 2 tests -------------------------------------------------------
Name (time in ms)                               Min              Median                 Max                Mean            StdDev               OPS          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[llvm] (development-)     201.5461 (1.0)      202.5124 (1.00)     203.0784 (1.0)      202.5192 (1.00)     0.3206 (1.0)      4.9378 (1.00)   
test_make_service[llvm] (new-dataset-)     201.7566 (1.00)     202.4854 (1.0)      203.2073 (1.00)     202.4763 (1.0)      0.3587 (1.12)     4.9389 (1.0)    
-------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_observation[dummy-cc]': 2 tests -----------------------------------------------------------
Name (time in us)                                  Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[dummy-cc] (development-)     111.6206 (1.0)      117.2810 (1.0)      128.3165 (1.01)     116.7909 (1.0)      3.1407 (1.0)            8.5623 (1.0)    
test_observation[dummy-cc] (new-dataset-)     111.6874 (1.00)     119.8749 (1.02)     126.6987 (1.0)      119.3415 (1.02)     3.4948 (1.11)           8.3793 (0.98)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[dummy-py] (development-)     351.1989 (1.0)      413.2817 (1.0)      475.7067 (1.0)      401.1166 (1.0)      22.3103 (1.11)           2.4930 (1.0)    
test_observation[dummy-py] (new-dataset-)     381.2727 (1.09)     421.1713 (1.02)     492.8895 (1.04)     422.0407 (1.05)     20.0245 (1.0)            2.3694 (0.95)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;AutophaseDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;AutophaseDict] (development-)     569.1370 (1.0)      602.9739 (1.03)     623.7917 (1.0)      590.6880 (1.00)     15.7981 (1.55)           1.6929 (1.00)   
test_observation[llvm;AutophaseDict] (new-dataset-)     579.2306 (1.02)     587.4920 (1.0)      624.3053 (1.00)     589.7491 (1.0)      10.1632 (1.0)            1.6956 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;Autophase]': 2 tests -----------------------------------------------------------
Name (time in us)                                        Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Autophase] (development-)     554.1710 (1.0)      558.1171 (1.0)      593.8818 (1.0)      560.6608 (1.0)       6.8719 (1.0)            1.7836 (1.0)    
test_observation[llvm;Autophase] (new-dataset-)     563.2491 (1.02)     568.6511 (1.02)     609.7621 (1.03)     580.1250 (1.03)     16.2488 (2.36)           1.7238 (0.97)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_observation[llvm;BitcodeFile]': 2 tests -----------------------------------------------------------
Name (time in us)                                          Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;BitcodeFile] (development-)     908.1498 (1.01)     924.9073 (1.00)     962.8525 (1.0)      922.1114 (1.0)      8.2142 (1.0)            1.0845 (1.0)    
test_observation[llvm;BitcodeFile] (new-dataset-)     894.9384 (1.0)      923.7803 (1.0)      964.9833 (1.00)     924.0150 (1.00)     9.2621 (1.13)           1.0822 (1.00)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;CpuInfo]': 2 tests -----------------------------------------------------------
Name (time in us)                                      Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;CpuInfo] (development-)     378.5042 (1.0)      380.7586 (1.0)      425.7582 (1.0)      390.6516 (1.0)      15.7723 (1.20)           2.5598 (1.0)    
test_observation[llvm;CpuInfo] (new-dataset-)     387.6636 (1.02)     395.8439 (1.04)     428.2742 (1.01)     399.7716 (1.02)     13.1021 (1.0)            2.5014 (0.98)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vecEmbeddingIndices]': 2 tests ------------------------------------------------------
Name (time in ms)                                                      Min             Median                Max               Mean            StdDev                OPS          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vecEmbeddingIndices] (development-)     20.4026 (1.00)     20.6619 (1.01)     30.2996 (1.28)     20.9854 (1.01)     1.5501 (2.07)     47.6523 (0.99)   
test_observation[llvm;Inst2vecEmbeddingIndices] (new-dataset-)     20.3171 (1.0)      20.4712 (1.0)      23.7006 (1.0)      20.8204 (1.0)      0.7502 (1.0)      48.0299 (1.0)    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vecPreprocessedText]': 2 tests ------------------------------------------------------
Name (time in ms)                                                      Min             Median                Max               Mean            StdDev                OPS          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vecPreprocessedText] (development-)     19.9224 (1.0)      20.1620 (1.0)      29.7508 (1.28)     20.4071 (1.0)      1.3099 (1.92)     49.0026 (1.0)    
test_observation[llvm;Inst2vecPreprocessedText] (new-dataset-)     20.2605 (1.02)     20.7100 (1.03)     23.2698 (1.0)      20.9090 (1.02)     0.6836 (1.0)      47.8263 (0.98)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vec]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vec] (development-)     21.4781 (1.04)     21.6170 (1.0)      48.6709 (2.18)     21.9228 (1.02)     2.7047 (6.72)     45.6146 (0.98)   
test_observation[llvm;Inst2vec] (new-dataset-)     20.7327 (1.0)      21.6944 (1.00)     22.3639 (1.0)      21.5298 (1.0)      0.4027 (1.0)      46.4472 (1.0)    
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountDict] (development-)     594.2966 (1.0)      606.5216 (1.0)      663.9021 (1.00)     612.2359 (1.0)      15.9214 (1.0)            1.6334 (1.0)    
test_observation[llvm;InstCountDict] (new-dataset-)     604.3585 (1.02)     621.9485 (1.03)     661.8658 (1.0)      628.3410 (1.03)     17.6577 (1.11)           1.5915 (0.97)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountNormDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                                Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountNormDict] (development-)     617.3755 (1.0)      634.6613 (1.0)      694.5362 (1.02)     640.4637 (1.0)      19.9173 (1.38)           1.5614 (1.0)    
test_observation[llvm;InstCountNormDict] (new-dataset-)     626.0326 (1.01)     660.3358 (1.04)     683.4273 (1.0)      659.4437 (1.03)     14.4769 (1.0)            1.5164 (0.97)   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountNorm]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountNorm] (development-)     586.1992 (1.0)      596.2407 (1.0)      655.6554 (1.01)     604.1648 (1.0)      15.7914 (1.12)           1.6552 (1.0)    
test_observation[llvm;InstCountNorm] (new-dataset-)     596.5968 (1.02)     605.3129 (1.02)     652.1745 (1.0)      610.3092 (1.01)     14.1455 (1.0)            1.6385 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                        Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCount] (development-)     568.1948 (1.0)      579.1192 (1.0)      643.4028 (1.03)     585.1269 (1.0)      14.9749 (1.39)           1.7090 (1.0)    
test_observation[llvm;InstCount] (new-dataset-)     578.4702 (1.02)     586.7447 (1.01)     623.2057 (1.0)      588.3855 (1.01)     10.7792 (1.0)            1.6996 (0.99)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountO0]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountO0] (development-)     353.9864 (1.0)      366.6099 (1.0)      406.5381 (1.01)     375.6579 (1.0)      16.5761 (1.18)           2.6620 (1.0)    
test_observation[llvm;IrInstructionCountO0] (new-dataset-)     362.0870 (1.02)     376.0681 (1.03)     401.8593 (1.0)      379.9843 (1.01)     13.9990 (1.0)            2.6317 (0.99)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountO3] (development-)     353.3580 (1.0)      355.4808 (1.0)      414.2023 (1.04)     358.6864 (1.0)       8.9728 (1.0)            2.7880 (1.0)    
test_observation[llvm;IrInstructionCountO3] (new-dataset-)     361.3955 (1.02)     367.6248 (1.03)     397.4840 (1.0)      371.4302 (1.04)     11.8482 (1.32)           2.6923 (0.97)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountOz] (development-)     353.4154 (1.0)      361.9962 (1.0)      388.6243 (1.0)      363.2143 (1.0)       6.9667 (1.0)            2.7532 (1.0)    
test_observation[llvm;IrInstructionCountOz] (new-dataset-)     359.8931 (1.02)     368.2843 (1.02)     396.8973 (1.02)     373.1967 (1.03)     12.9433 (1.86)           2.6796 (0.97)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                                 Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCount] (development-)     369.4690 (1.0)      376.9218 (1.0)      433.0396 (1.04)     384.4245 (1.0)      16.3143 (1.13)           2.6013 (1.0)    
test_observation[llvm;IrInstructionCount] (new-dataset-)     369.4935 (1.00)     386.3037 (1.02)     416.9821 (1.0)      391.4656 (1.02)     14.4663 (1.0)            2.5545 (0.98)   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;Ir]': 2 tests -----------------------------------------------------------
Name (time in us)                                 Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Ir] (development-)     752.2196 (1.00)     756.0559 (1.0)      807.7845 (1.0)      765.5881 (1.0)      16.2103 (1.0)            1.3062 (1.0)    
test_observation[llvm;Ir] (new-dataset-)     751.2741 (1.0)      763.9933 (1.01)     830.2374 (1.03)     771.1941 (1.01)     17.6519 (1.09)           1.2967 (0.99)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;ObjectTextSizeBytes]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeBytes] (development-)     12.6753 (1.0)      14.7683 (1.0)      16.0970 (1.0)      14.7236 (1.0)      0.6736 (1.17)     67.9181 (1.0)    
test_observation[llvm;ObjectTextSizeBytes] (new-dataset-)     13.8054 (1.09)     14.8847 (1.01)     16.1217 (1.00)     15.0045 (1.02)     0.5768 (1.0)      66.6468 (0.98)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeO0]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeO0] (development-)     354.4941 (1.0)      359.3133 (1.0)      392.8660 (1.0)      360.2502 (1.0)       6.7107 (1.0)            2.7758 (1.0)    
test_observation[llvm;ObjectTextSizeO0] (new-dataset-)     362.1932 (1.02)     377.7409 (1.05)     401.6805 (1.02)     379.7996 (1.05)     12.2977 (1.83)           2.6330 (0.95)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeO3] (development-)     360.4684 (1.0)      362.6466 (1.0)      377.8102 (1.0)      363.3284 (1.0)       3.1129 (1.0)            2.7523 (1.0)    
test_observation[llvm;ObjectTextSizeO3] (new-dataset-)     360.4812 (1.00)     370.2278 (1.02)     401.3857 (1.06)     373.9617 (1.03)     11.8220 (3.80)           2.6741 (0.97)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeOz] (development-)     321.5930 (1.0)      355.2196 (1.0)      399.3834 (1.0)      351.7564 (1.0)      23.1039 (1.81)           2.8429 (1.0)    
test_observation[llvm;ObjectTextSizeOz] (new-dataset-)     363.6290 (1.13)     377.2168 (1.06)     404.1628 (1.01)     380.4841 (1.08)     12.7400 (1.0)            2.6282 (0.92)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Programl]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Programl] (development-)     31.1456 (1.0)      33.2769 (1.0)      51.7074 (1.17)     37.0543 (1.0)      6.5835 (3.47)     26.9874 (1.0)    
test_observation[llvm;Programl] (new-dataset-)     35.5645 (1.14)     39.9278 (1.20)     44.3395 (1.0)      40.0480 (1.08)     1.8946 (1.0)      24.9700 (0.93)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_reset[dummy-cc]': 2 tests -----------------------------------------------------------
Name (time in us)                            Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[dummy-cc] (development-)     174.3554 (1.0)      186.7939 (1.0)      207.2290 (1.0)      188.5844 (1.0)      6.8768 (1.0)            5.3027 (1.0)    
test_reset[dummy-cc] (new-dataset-)     184.6878 (1.06)     202.2127 (1.08)     220.9869 (1.07)     201.9652 (1.07)     7.4495 (1.08)           4.9513 (0.93)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reset[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[dummy-py] (development-)     472.7083 (1.0)      531.4210 (1.0)      627.5417 (1.0)      529.1365 (1.0)      52.9727 (1.68)           1.8899 (1.0)    
test_reset[dummy-py] (new-dataset-)     506.5356 (1.07)     566.5837 (1.07)     655.9474 (1.05)     567.7660 (1.07)     31.6092 (1.0)            1.7613 (0.93)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------- benchmark 'test_reset[llvm;fast-benchmark]': 2 tests ----------------------------------------------------
Name (time in ms)                                     Min            Median               Max              Mean            StdDev                 OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[llvm;fast-benchmark] (development-)     1.1733 (1.0)      1.5876 (1.03)     2.2853 (1.40)     1.5149 (1.0)      0.2462 (6.83)     660.1302 (1.0)    
test_reset[llvm;fast-benchmark] (new-dataset-)     1.4504 (1.24)     1.5402 (1.0)      1.6347 (1.0)      1.5375 (1.01)     0.0360 (1.0)      650.3941 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reset[llvm;slow-benchmark]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[llvm;slow-benchmark] (development-)     74.1077 (1.0)      80.9428 (1.0)      93.8438 (1.07)     80.9114 (1.0)      2.8289 (1.02)     12.3592 (1.0)    
test_reset[llvm;slow-benchmark] (new-dataset-)     75.0286 (1.01)     82.7532 (1.02)     87.8640 (1.0)      83.0267 (1.03)     2.7793 (1.0)      12.0443 (0.97)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[dummy-cc]': 2 tests ----------------------------------------------------------
Name (time in us)                             Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[dummy-cc] (development-)     104.5140 (1.0)      118.2826 (1.0)      125.9687 (1.0)      117.8680 (1.0)      5.3668 (1.85)           8.4841 (1.0)    
test_reward[dummy-cc] (new-dataset-)     116.8716 (1.12)     124.4331 (1.05)     132.5936 (1.05)     124.4742 (1.06)     2.8980 (1.0)            8.0338 (0.95)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                             Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[dummy-py] (development-)     285.5907 (1.0)      293.8494 (1.0)      362.0608 (1.0)      305.7921 (1.0)      20.2975 (1.0)            3.2702 (1.0)    
test_reward[dummy-py] (new-dataset-)     297.3947 (1.04)     334.5770 (1.14)     405.7645 (1.12)     335.1628 (1.10)     24.1988 (1.19)           2.9836 (0.91)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountNorm]': 2 tests -----------------------------------------------------------
Name (time in us)                                                Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountNorm] (development-)     383.9045 (1.01)     386.9356 (1.0)      430.3412 (1.03)     398.4357 (1.01)     14.3518 (1.10)           2.5098 (0.99)   
test_reward[llvm;IrInstructionCountNorm] (new-dataset-)     378.4555 (1.0)      387.6897 (1.00)     416.1819 (1.0)      392.7855 (1.0)      13.0213 (1.0)            2.5459 (1.0)    
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                              Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountO3] (development-)     347.1320 (1.0)      384.4029 (1.0)      411.6064 (1.0)      390.8936 (1.0)      17.9807 (1.45)           2.5582 (1.0)    
test_reward[llvm;IrInstructionCountO3] (new-dataset-)     380.8883 (1.10)     385.9082 (1.00)     417.5061 (1.01)     391.5075 (1.00)     12.3886 (1.0)            2.5542 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                              Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountOz] (development-)     376.0134 (1.0)      377.7239 (1.0)      412.7454 (1.0)      380.8870 (1.0)       7.6893 (1.0)            2.6255 (1.0)    
test_reward[llvm;IrInstructionCountOz] (new-dataset-)     379.6122 (1.01)     387.6663 (1.03)     416.5290 (1.01)     390.8013 (1.03)     10.3978 (1.35)           2.5588 (0.97)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCount] (development-)     374.9172 (1.0)      409.4628 (1.05)     419.4225 (1.00)     397.4140 (1.01)     15.3283 (1.34)           2.5163 (0.99)   
test_reward[llvm;IrInstructionCount] (new-dataset-)     381.6203 (1.02)     390.0886 (1.0)      418.5234 (1.0)      393.8175 (1.0)      11.4720 (1.0)            2.5392 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeBytes]': 2 tests ------------------------------------------------------
Name (time in ms)                                            Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeBytes] (development-)     12.3360 (1.0)      14.9160 (1.0)      16.4057 (1.02)     14.9060 (1.0)      0.7356 (1.34)     67.0869 (1.0)    
test_reward[llvm;ObjectTextSizeBytes] (new-dataset-)     13.6626 (1.11)     14.9567 (1.00)     16.1302 (1.0)      14.9730 (1.00)     0.5490 (1.0)      66.7868 (1.00)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeNorm]': 2 tests ------------------------------------------------------
Name (time in ms)                                           Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeNorm] (development-)     13.3458 (1.07)     14.8738 (1.01)     16.2651 (1.00)     14.8633 (1.01)     0.6330 (1.0)      67.2796 (0.99)   
test_reward[llvm;ObjectTextSizeNorm] (new-dataset-)     12.4342 (1.0)      14.7895 (1.0)      16.1935 (1.0)      14.7106 (1.0)      0.6690 (1.06)     67.9784 (1.0)    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeO3]': 2 tests ------------------------------------------------------
Name (time in ms)                                         Min             Median                Max               Mean            StdDev                OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeO3] (development-)     12.7029 (1.0)      14.8305 (1.00)     16.0186 (1.01)     14.8115 (1.0)      0.6276 (1.04)     67.5152 (1.0)    
test_reward[llvm;ObjectTextSizeO3] (new-dataset-)     13.3724 (1.05)     14.7628 (1.0)      15.8921 (1.0)      14.8179 (1.00)     0.6018 (1.0)      67.4859 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeOz]': 2 tests ------------------------------------------------------
Name (time in ms)                                         Min             Median                Max               Mean            StdDev                OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeOz] (development-)     12.5970 (1.0)      15.0874 (1.02)     16.4650 (1.02)     14.9618 (1.02)     0.7000 (1.00)     66.8369 (0.98)   
test_reward[llvm;ObjectTextSizeOz] (new-dataset-)     12.9173 (1.03)     14.7451 (1.0)      16.0704 (1.0)      14.7057 (1.0)      0.6988 (1.0)      68.0010 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[dummy-cc]': 2 tests ----------------------------------------------------------
Name (time in us)                           Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[dummy-cc] (development-)      97.4931 (1.0)      104.0608 (1.0)      108.4755 (1.0)      103.8169 (1.0)      2.9708 (1.25)           9.6323 (1.0)    
test_step[dummy-cc] (new-dataset-)     101.8252 (1.04)     105.9125 (1.02)     115.7945 (1.07)     105.9827 (1.02)     2.3805 (1.0)            9.4355 (0.98)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                           Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[dummy-py] (development-)     273.3084 (1.0)      304.8896 (1.0)      376.2387 (1.08)     297.6722 (1.0)      20.5929 (2.26)           3.3594 (1.0)    
test_step[dummy-py] (new-dataset-)     274.5965 (1.00)     309.1488 (1.01)     349.9082 (1.0)      309.3928 (1.04)      9.1237 (1.0)            3.2321 (0.96)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[llvm;fast-benchmark;fast-action]': 2 tests -----------------------------------------------------------
Name (time in us)                                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;fast-benchmark;fast-action] (development-)     412.5318 (1.0)      441.8161 (1.0)      488.9862 (1.0)      442.4692 (1.0)      15.2037 (1.0)            2.2600 (1.0)    
test_step[llvm;fast-benchmark;fast-action] (new-dataset-)     413.5961 (1.00)     444.7389 (1.01)     490.6238 (1.00)     444.3464 (1.00)     15.2365 (1.00)           2.2505 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[llvm;fast-benchmark;slow-action]': 2 tests -----------------------------------------------------------
Name (time in us)                                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;fast-benchmark;slow-action] (development-)     730.8104 (1.0)      743.6683 (1.00)     843.6199 (1.08)     746.4463 (1.0)      18.1063 (1.22)           1.3397 (1.0)    
test_step[llvm;fast-benchmark;slow-action] (new-dataset-)     737.0641 (1.01)     741.1574 (1.0)      781.5783 (1.0)      752.1123 (1.01)     14.8591 (1.0)            1.3296 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_step[llvm;slow-benchmark;fast-action]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;slow-benchmark;fast-action] (development-)     13.5394 (1.0)      16.5735 (1.0)      21.9988 (1.26)     16.1273 (1.0)      2.3965 (8.30)     62.0069 (1.0)    
test_step[llvm;slow-benchmark;fast-action] (new-dataset-)     16.0728 (1.19)     16.7939 (1.01)     17.5153 (1.0)      16.7986 (1.04)     0.2889 (1.0)      59.5288 (0.96)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_step[llvm;slow-benchmark;slow-action]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;slow-benchmark;slow-action] (development-)     62.5400 (1.0)      69.3483 (1.0)      83.5913 (1.03)     69.4179 (1.0)      3.2184 (1.08)     14.4055 (1.0)    
test_step[llvm;slow-benchmark;slow-action] (new-dataset-)     63.4310 (1.01)     69.7768 (1.01)     81.4955 (1.0)      69.8222 (1.01)     2.9710 (1.0)      14.3221 (0.99)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Issue #45.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2021
Copy link
Contributor

@hughleat hughleat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

force a specific benchmark to be chosen, set this property (or pass
the benchmark as an argument to :func:`reset`):
By default, a benchmark will be selected randomly by the service from
the available benchmarks on a call to :func:`reset`. To force a specific
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there always guaranteed to be at least one available benchmark?
What happens if someone uninstalls all the datasets?

The distribution of this changes based on what you have installed. I still think removing randomness from most of these apis is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, these docs are stale. After our chat on Monday I took your recommendation of removing randomness from the datasets in df5dcdb. Now, an environment has a default benchmark (defaults to first available next(env.datasets.benchmarks())). If there are no datasets, a TypeError is raised on call to reset()

compiler_gym/third_party/cbench/BUILD Outdated Show resolved Hide resolved
@@ -25,10 +24,10 @@ class MinimizationError(OSError):

# A hypothesis is a callback that accepts as input an enivornment in a given
# state returns true if a particular hypothesis holds, else false.
Hypothesis = Callable[[CompilerEnv], bool]
Hypothesis = Callable[["CompilerEnv"], bool] # noqa: F821
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious, what do the quotes do in "CompilerEnv"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a string literal type hint defers its evaluation. Otherwise CompilerEnv name would be evaluated here. This is used for forward references or, in this case, to break a circular reference.

As of py3.7 there is a better way of handling this, but CompilerGym supports py3.6: https://www.python.org/dev/peps/pep-0563

This extends the micro-benchmark script to record and report runtimes
for operations on the example gym service implementations in C++ and
Python. The idea is that this provides a useful reference when
evaluating the measurements of other environments. Since the example
services do no compilation, the benchmark performances are close to
a roofline.
This switches over the `CompilerEnv` environment to use the new
dataset API, dropping the `LegacyDataset` class.

Background
----------

Since the very first prototype of CompilerGym, a `Benchmark` protocol
buffer has been used to provide a serializable representation of
benchmarks that can be passed back and forth between the service and
the frontend.

Initially, it was up to the compiler service to maintain the set of
available benchmarks, exposing the available benchmarks with a
`GetBenchmarks()` RPC method, and allowing new benchmarks to be added
using an `AddBenchmarks()` method.

This was fine for the initial use case of shipping a handful of
benchmarks and allowing ad-hoc new benchmarks to be added, but for
managing larger sets of benchmarks, a *datasets* abstraction was
added.

Initial Datasets abstraction
----------------------------

To add support for managing large sets of programs, a
[Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20)
tuple was added that describes a set of programs, and a link to the a
tarball containing those programs. The tarball is required to have a
JSON file containing metadata, and a directory containing the
benchmarks, one file per benchmark. A set of operations were added to
the frontend command line to make downloading and unpacking these
tarballs easier:

https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133

Problems with this approach
---------------------------

(1) **Leaky abstraction** Both the environment and backend service
have to know about datasets. This means redundant duplicated logic,
and adds a maintenance burden of keeping the C++/python logic in sync.

(2) **Inflexible** Only supports environments in which a single file
represents a benchmark. No support for multi-file benchmarks,
benchmarks that are compiled on-demand, etc.

(3) **O(n) space and time overhead** on each service instance, where *n*
is the total number of benchmarks. At init time, each service needs to
recursively scan a directory tree to build a list of available
benchmarks. This list must be kept in memory. This adds startup time,
and also causes cache invalidation issues when multiple environment
instances are modifying the underlying filesystem.

New Dataset API
---------------

This commit changes the ownership model so that the *Environment* owns
the benchmarks and datasets, not the service. This uses the new
`Dataset` class hierarchy that has been added in previous pull
requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201.

Now, the backend has no knowledge of "datasets". Instead the service
simply keeps a small cache of benchmarks that it has seen. If a
session request has a benchmark URI that is not in this cache, the
service returns a "resource not found" error and the frontend logic
can then respond by sending it a copy of the benchmark as a
`Benchmark` proto. The service is free to cache this for future use,
and can empty the cache whenever it wants.

This new approach has a few key benefits:

(1) By moving all of the datasets logic into the frontend, it becomes
much easier for users to define their own datasets.

(2) Reduces compiler service startup time as it removes the need for
each service to do a recursive filesystem sweep.

(3) Removes the requirement that the set of benchmarks is fully
enumerable, allow for program generators that can produce a
theoretically infinite number of benchmarks.

(4) Adds support for lazily-compiled datasets of programs that are
generated on-demand.

(5) Removes the need to download datasets ahead of time. Datasets can
now be installed on-demand.

Summary of changes
------------------

(1) Changes the type of `env.benchmark` from a string to a `Benchmark`
instance.

(2) Makes `env.benchmark` a mandatory attribute. If no benchmark is
provided at init time, one is chosen deterministically. If you wish to
select a random benchmark, use `env.datasets.benchmark()`.

(3) `env.fork()` no longer requires `env.reset()` to have been called
first. It will call `env.reset()` if required.

(4) `env.benchmark = None` is no longer a valid way of requesting a
random benchmark. If you would like a random benchmark, you must now
roll your own random picker using `env.datasets.benchmark_uris()` and
similar.

(5) Deprecates all `LegacyDataset` operations, changing their behavior
to no-ops, and removing the class.

(6) Renames `cBench` to `cbench` to be consistent with the lower-case
naming convention of gym. The old `cBench` datasets are kept around
but are marked deprecated to encourage migration.

Migrating to the new interface
------------------------------

To migrate existing code to the new interface:

(1) Update references to `cBench-v[01]` to `cbench-v1`.

(2) Review code that accesses the `env.benchmark` property and update
to `env.benchmark.uri` if a string name is required.

(3) Review code that calls `env.reset()` without first setting a
benchmark. Previously, calling `env.reset()` would select a random
benchmark. Now, `env.reset()` always selects the last used benchmark,
or a predetermined default if none is specified.

(4) Review code that relies on `env.benchmark` being `None` to select
benchmarks randomly. Now, `env.benchmark` is always set to the
previously used benchmark, or a predetermined default benchmark if
none has been provided.

(5) Remove calls to `env.require_dataset()`.

Issue facebookresearch#45.
@ChrisCummins ChrisCummins merged commit 6f7b6ff into facebookresearch:development Apr 29, 2021
@ChrisCummins ChrisCummins deleted the new-dataset-api branch April 29, 2021 09:56
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
ChrisCummins added a commit that referenced this pull request Apr 30, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (#191, #192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (#207), a dataset of
    OpenCL kernels (#208), and a dataset of compilable C functions
    (#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (#171).
-   [docs] We have added a Feature Stability section to the
    documentation (#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
bwasti pushed a commit to bwasti/CompilerGym that referenced this pull request Aug 3, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (facebookresearch#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (facebookresearch#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (facebookresearch#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (facebookresearch#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of
    OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions
    (facebookresearch#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (facebookresearch#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (facebookresearch#171).
-   [docs] We have added a Feature Stability section to the
    documentation (facebookresearch#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (facebookresearch#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants