[datasets] Switch CompilerEnv to the new dataset API. #222

ChrisCummins · 2021-04-27T12:49:20Z

This switches over the CompilerEnv environment to use the new dataset API, dropping the LegacyDataset class.

Disclaimer: Sorry for the large diff :-( Two reasons: (1) renaming cBench to cbench causes a lot of churn, (2) the datasets API touches nearly everything, and the switch from old to new API should be atomic.

Background

Since the very first prototype of CompilerGym, a Benchmark protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend:

CompilerGym/compiler_gym/service/proto/compiler_gym_service.proto

Lines 266 to 276 in 49c10d7

    
           message Benchmark { 
        
             // The name of the benchmark to add. In case of conflict with an existing 
        
             // benchmark, this new benchmark replaces the existing one. 
        
             string uri = 1; 
        
             // The description of the program that is being compiled. It is up to the 
        
             // service to determine how to interpret this file, and it is the 
        
             // responsibility of the client to ensure that it provides the correct format. 
        
             // For example, the service could expect that this file contains serialized 
        
             // IR data, or an input source file. 
        
             File program = 2; 
        
           }

Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a GetBenchmarks() RPC method, and allowing new benchmarks to be added using an AddBenchmarks() method.

This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a datasets abstraction was added.

Initial Datasets abstraction

To add support for managing large sets of programs, a Dataset tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier:

CompilerGym/compiler_gym/bin/datasets.py

Lines 5 to 133 in 49c10d7

    
           """Manage datasets of benchmarks. 
        
           .. code-block:: 
        
               $ python -m compiler_gym.bin.datasets --env=<env> [command...] 
        
           Where :code:`command` is one of :code:`--download=<dataset...>`, 
        
           :code:`--activate=<dataset...>`, :code:`--deactivate=<dataset...>`, 
        
           and :code:`--delete=<dataset...>`. 
        
           Listing installed datasets 
        
           -------------------------- 
        
           If run with no arguments, this command shows an overview of the datasets that 
        
           are activate, inactive, and available to download. For example: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 
        
               llvm-v0 benchmarks site dir: /home/user/.local/share/compiler_gym/llvm/10.0.0/bitcode_benchmarks 
        
               +-------------------+--------------+-----------------+----------------+ 
        
               | Active Datasets   | License      |   #. Benchmarks | Size on disk   | 
        
               +===================+==============+=================+================+ 
        
               | cBench-v1         | BSD 3-Clause |              23 | 10.1 MB        | 
        
               +-------------------+--------------+-----------------+----------------+ 
        
               | Total             |              |              23 | 10.1 MB        | 
        
               +-------------------+--------------+-----------------+----------------+ 
        
               These benchmarks are ready for use. Deactivate them using `--deactivate=<name>`. 
        
               +---------------------+-----------+-----------------+----------------+ 
        
               | Inactive Datasets   | License   |   #. Benchmarks | Size on disk   | 
        
               +=====================+===========+=================+================+ 
        
               | Total               |           |               0 | 0 Bytes        | 
        
               +---------------------+-----------+-----------------+----------------+ 
        
               These benchmarks may be activated using `--activate=<name>`. 
        
               +------------------------+---------------------------------+-----------------+----------------+ 
        
               | Downloadable Dataset   | License                         | #. Benchmarks   | Size on disk   | 
        
               +========================+=================================+=================+================+ 
        
               | blas-v0                | BSD 3-Clause                    | 300             | 4.0 MB         | 
        
               +------------------------+---------------------------------+-----------------+----------------+ 
        
               | polybench-v0           | BSD 3-Clause                    | 27              | 162.6 kB       | 
        
               +------------------------+---------------------------------+-----------------+----------------+ 
        
               These benchmarks may be installed using `--download=<name> --activate=<name>`. 
        
           Downloading datasets 
        
           -------------------- 
        
           Use :code:`--download` to download a dataset from the list of available datasets: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=npb-v0 
        
           After downloading, the dataset will be activated and the benchmarks will be 
        
           available to use by the environment. 
        
               >>> import compiler_gym 
        
               >>> import gym 
        
               >>> env = gym.make("llvm-v0") 
        
               >>> env.benchmark = "npb-v0" 
        
           The flag :code:`--download_all` can be used to download every available dataset: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download_all 
        
           :code:`--download` accepts the URL of any :code:`.tar.bz2` file to support custom datasets: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=https://example.com/dataset.tar.bz2 
        
           Or use the :code:`file:///` URI to install a local archive file: 
        
           .. code-block:: 
        
               $ python -m compiler_gym.bin.benchmarks --env=llvm-v0 --download=file:////tmp/dataset.tar.bz2 
        
           The list of datasets that are available to download may be extended by calling 
        
           :meth:`CompilerEnv.register_dataset() <compiler_gym.envs.CompilerEnv.register_dataset>` 
        
           on a :code:`CompilerEnv` instance. 
        
           To programmatically download datasets, see 
        
           :meth:`CompilerEnv.require_dataset() <compiler_gym.envs.CompilerEnv.require_dataset>`. 
        
           Activating and deactivating datasets 
        
           ------------------------------------ 
        
           Datasets have two states: active and inactive. An inactive dataset still exists 
        
           locally on the filesystem, but is excluded from use by CompilerGym environments. 
        
           This be useful if you have many datasets downloaded and you would to limit the 
        
           benchmarks that can be selected randomly by an environment. 
        
           Activate or deactivate datasets using the :code:`--activate` and :code:`--deactivate` 
        
           flags, respectively: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate=npb-v0,github-v0 --deactivate=cBench-v1 
        
           The :code:`--activate_all` and :code:`--deactivate_all` flags can be used as a 
        
           shortcut to activate or deactivate every downloaded: 
        
           .. code-block:: 
        
               # Activate all inactivate datasets: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate_all 
        
               # Make all activate datasets inactive: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --deactivate_all 
        
           Deleting datasets 
        
           ----------------- 
        
           To remove a dataset from the filesystem, use :code:`--delete`: 
        
           .. code-block:: 
        
               $ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --delete=npb-v0 
        
           Once deleted, a dataset must be downloaded before it can be used again. 
        
           A :code:`--delete_all` flag can be used to delete all of the locally installed 
        
           datasets. 
        
           """

Problems with this approach

Leaky abstraction Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync.
Inflexible Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc.
O(n) space and time overhead on each service instance, where n is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem.

New Dataset API

This pull request changes the ownership model so that the Environment owns the benchmarks and datasets, not the service. This uses the new Dataset class hierarchy that has been added in previous pull requests: #190, #191, #192, #200, #201.

Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a Benchmark proto. The service is free to cache this for future use, and can empty the cache whenever it wants.

This new approach has a few key benefits:

By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets.
Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep.
Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks.
Adds support for lazily-compiled datasets of programs that are generated on-demand.
Removes the need to download datasets ahead of time. Datasets can now be installed on-demand.

Summary of changes

Changes the type of env.benchmark from a string to a Benchmark instance.
Makes env.benchmark a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use env.datasets.benchmark().
env.fork() no longer requires env.reset() to have been called first. It will call env.reset() if required.
env.benchmark = None is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using env.datasets.benchmark_uris() and similar.
Deprecates all LegacyDataset operations, changing their behavior to no-ops, and removing the class.
Renames cBench to cbench to be consistent with the lower-case naming convention of gym. The old cBench datasets are kept around but are marked deprecated to encourage migration.

Migrating to the new interface

To migrate existing code to the new interface:

Update references to cBench-v[01] to cbench-v1.
Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required.
Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified.
Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided.
Remove calls to env.require_dataset().

Performance impact

A comparison of benchmarks/bench_test.py on current development vs this PR:

---------------------------------------------------- benchmark 'test_fork[llvm;fast-benchmark]': 2 tests -----------------------------------------------------
Name (time in ms)                                    Min            Median               Max              Mean            StdDev                 OPS          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fork[llvm;fast-benchmark] (development-)     4.7868 (1.08)     5.2404 (1.01)     8.8085 (1.21)     5.2855 (1.01)     0.5359 (1.12)     189.1979 (0.99)   
test_fork[llvm;fast-benchmark] (new-dataset-)     4.4304 (1.0)      5.2111 (1.0)      7.2661 (1.0)      5.2436 (1.0)      0.4801 (1.0)      190.7098 (1.0)    
--------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_fork[llvm;slow-benchmark]': 2 tests -------------------------------------------------------
Name (time in ms)                                     Min             Median                 Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fork[llvm;slow-benchmark] (development-)     76.2420 (1.0)      85.9890 (1.07)     125.7991 (1.36)     86.5312 (1.07)     5.5715 (2.18)     11.5565 (0.93)   
test_fork[llvm;slow-benchmark] (new-dataset-)     76.9690 (1.01)     80.1162 (1.0)       92.2011 (1.0)      80.7325 (1.0)      2.5577 (1.0)      12.3866 (1.0)    
------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[dummy-cc]': 2 tests -------------------------------------------------------
Name (time in ms)                                 Min              Median                 Max                Mean            StdDev               OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[dummy-cc] (development-)     109.6519 (1.00)     110.1575 (1.0)      112.0789 (1.00)     110.2666 (1.0)      0.4347 (1.0)      9.0689 (1.0)    
test_make_local[dummy-cc] (new-dataset-)     109.5566 (1.0)      110.3070 (1.00)     111.9780 (1.0)      110.4062 (1.00)     0.4862 (1.12)     9.0575 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[dummy-py]': 2 tests -------------------------------------------------------
Name (time in ms)                                 Min              Median                 Max                Mean            StdDev               OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[dummy-py] (development-)     555.5705 (1.00)     558.3913 (1.00)     566.5027 (1.00)     559.0482 (1.00)     2.2833 (1.05)     1.7888 (1.00)   
test_make_local[dummy-py] (new-dataset-)     554.8298 (1.0)      557.6217 (1.0)      566.4247 (1.0)      558.3312 (1.0)      2.1698 (1.0)      1.7911 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_local[llvm]': 2 tests -------------------------------------------------------
Name (time in ms)                             Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_local[llvm] (development-)     113.2592 (1.01)     114.1064 (1.00)     145.3874 (1.02)     114.5627 (1.0)      3.1670 (1.0)      8.7288 (1.0)    
test_make_local[llvm] (new-dataset-)     112.6326 (1.0)      113.9273 (1.0)      142.5838 (1.0)      114.6359 (1.00)     3.9459 (1.25)     8.7233 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[dummy-cc]': 2 tests -------------------------------------------------------
Name (time in ms)                                   Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[dummy-cc] (development-)     201.4110 (1.0)      202.2620 (1.0)      203.0277 (1.0)      202.2196 (1.0)      0.3515 (1.0)      4.9451 (1.0)    
test_make_service[dummy-cc] (new-dataset-)     201.4547 (1.00)     202.3753 (1.00)     203.0454 (1.00)     202.3238 (1.00)     0.3919 (1.11)     4.9426 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[dummy-py]': 2 tests -------------------------------------------------------
Name (time in ms)                                   Min              Median                 Max                Mean            StdDev               OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[dummy-py] (development-)     201.3660 (1.0)      202.2194 (1.00)     203.1361 (1.00)     202.2493 (1.00)     0.4470 (1.22)     4.9444 (1.00)   
test_make_service[dummy-py] (new-dataset-)     201.4377 (1.00)     202.1623 (1.0)      203.0117 (1.0)      202.1713 (1.0)      0.3653 (1.0)      4.9463 (1.0)    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------- benchmark 'test_make_service[llvm]': 2 tests -------------------------------------------------------
Name (time in ms)                               Min              Median                 Max                Mean            StdDev               OPS          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
test_make_service[llvm] (development-)     201.5461 (1.0)      202.5124 (1.00)     203.0784 (1.0)      202.5192 (1.00)     0.3206 (1.0)      4.9378 (1.00)   
test_make_service[llvm] (new-dataset-)     201.7566 (1.00)     202.4854 (1.0)      203.2073 (1.00)     202.4763 (1.0)      0.3587 (1.12)     4.9389 (1.0)    
-------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_observation[dummy-cc]': 2 tests -----------------------------------------------------------
Name (time in us)                                  Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[dummy-cc] (development-)     111.6206 (1.0)      117.2810 (1.0)      128.3165 (1.01)     116.7909 (1.0)      3.1407 (1.0)            8.5623 (1.0)    
test_observation[dummy-cc] (new-dataset-)     111.6874 (1.00)     119.8749 (1.02)     126.6987 (1.0)      119.3415 (1.02)     3.4948 (1.11)           8.3793 (0.98)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[dummy-py] (development-)     351.1989 (1.0)      413.2817 (1.0)      475.7067 (1.0)      401.1166 (1.0)      22.3103 (1.11)           2.4930 (1.0)    
test_observation[dummy-py] (new-dataset-)     381.2727 (1.09)     421.1713 (1.02)     492.8895 (1.04)     422.0407 (1.05)     20.0245 (1.0)            2.3694 (0.95)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;AutophaseDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;AutophaseDict] (development-)     569.1370 (1.0)      602.9739 (1.03)     623.7917 (1.0)      590.6880 (1.00)     15.7981 (1.55)           1.6929 (1.00)   
test_observation[llvm;AutophaseDict] (new-dataset-)     579.2306 (1.02)     587.4920 (1.0)      624.3053 (1.00)     589.7491 (1.0)      10.1632 (1.0)            1.6956 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;Autophase]': 2 tests -----------------------------------------------------------
Name (time in us)                                        Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Autophase] (development-)     554.1710 (1.0)      558.1171 (1.0)      593.8818 (1.0)      560.6608 (1.0)       6.8719 (1.0)            1.7836 (1.0)    
test_observation[llvm;Autophase] (new-dataset-)     563.2491 (1.02)     568.6511 (1.02)     609.7621 (1.03)     580.1250 (1.03)     16.2488 (2.36)           1.7238 (0.97)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_observation[llvm;BitcodeFile]': 2 tests -----------------------------------------------------------
Name (time in us)                                          Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;BitcodeFile] (development-)     908.1498 (1.01)     924.9073 (1.00)     962.8525 (1.0)      922.1114 (1.0)      8.2142 (1.0)            1.0845 (1.0)    
test_observation[llvm;BitcodeFile] (new-dataset-)     894.9384 (1.0)      923.7803 (1.0)      964.9833 (1.00)     924.0150 (1.00)     9.2621 (1.13)           1.0822 (1.00)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;CpuInfo]': 2 tests -----------------------------------------------------------
Name (time in us)                                      Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;CpuInfo] (development-)     378.5042 (1.0)      380.7586 (1.0)      425.7582 (1.0)      390.6516 (1.0)      15.7723 (1.20)           2.5598 (1.0)    
test_observation[llvm;CpuInfo] (new-dataset-)     387.6636 (1.02)     395.8439 (1.04)     428.2742 (1.01)     399.7716 (1.02)     13.1021 (1.0)            2.5014 (0.98)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vecEmbeddingIndices]': 2 tests ------------------------------------------------------
Name (time in ms)                                                      Min             Median                Max               Mean            StdDev                OPS          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vecEmbeddingIndices] (development-)     20.4026 (1.00)     20.6619 (1.01)     30.2996 (1.28)     20.9854 (1.01)     1.5501 (2.07)     47.6523 (0.99)   
test_observation[llvm;Inst2vecEmbeddingIndices] (new-dataset-)     20.3171 (1.0)      20.4712 (1.0)      23.7006 (1.0)      20.8204 (1.0)      0.7502 (1.0)      48.0299 (1.0)    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vecPreprocessedText]': 2 tests ------------------------------------------------------
Name (time in ms)                                                      Min             Median                Max               Mean            StdDev                OPS          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vecPreprocessedText] (development-)     19.9224 (1.0)      20.1620 (1.0)      29.7508 (1.28)     20.4071 (1.0)      1.3099 (1.92)     49.0026 (1.0)    
test_observation[llvm;Inst2vecPreprocessedText] (new-dataset-)     20.2605 (1.02)     20.7100 (1.03)     23.2698 (1.0)      20.9090 (1.02)     0.6836 (1.0)      47.8263 (0.98)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Inst2vec]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Inst2vec] (development-)     21.4781 (1.04)     21.6170 (1.0)      48.6709 (2.18)     21.9228 (1.02)     2.7047 (6.72)     45.6146 (0.98)   
test_observation[llvm;Inst2vec] (new-dataset-)     20.7327 (1.0)      21.6944 (1.00)     22.3639 (1.0)      21.5298 (1.0)      0.4027 (1.0)      46.4472 (1.0)    
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountDict] (development-)     594.2966 (1.0)      606.5216 (1.0)      663.9021 (1.00)     612.2359 (1.0)      15.9214 (1.0)            1.6334 (1.0)    
test_observation[llvm;InstCountDict] (new-dataset-)     604.3585 (1.02)     621.9485 (1.03)     661.8658 (1.0)      628.3410 (1.03)     17.6577 (1.11)           1.5915 (0.97)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountNormDict]': 2 tests -----------------------------------------------------------
Name (time in us)                                                Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountNormDict] (development-)     617.3755 (1.0)      634.6613 (1.0)      694.5362 (1.02)     640.4637 (1.0)      19.9173 (1.38)           1.5614 (1.0)    
test_observation[llvm;InstCountNormDict] (new-dataset-)     626.0326 (1.01)     660.3358 (1.04)     683.4273 (1.0)      659.4437 (1.03)     14.4769 (1.0)            1.5164 (0.97)   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCountNorm]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCountNorm] (development-)     586.1992 (1.0)      596.2407 (1.0)      655.6554 (1.01)     604.1648 (1.0)      15.7914 (1.12)           1.6552 (1.0)    
test_observation[llvm;InstCountNorm] (new-dataset-)     596.5968 (1.02)     605.3129 (1.02)     652.1745 (1.0)      610.3092 (1.01)     14.1455 (1.0)            1.6385 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;InstCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                        Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;InstCount] (development-)     568.1948 (1.0)      579.1192 (1.0)      643.4028 (1.03)     585.1269 (1.0)      14.9749 (1.39)           1.7090 (1.0)    
test_observation[llvm;InstCount] (new-dataset-)     578.4702 (1.02)     586.7447 (1.01)     623.2057 (1.0)      588.3855 (1.01)     10.7792 (1.0)            1.6996 (0.99)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountO0]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountO0] (development-)     353.9864 (1.0)      366.6099 (1.0)      406.5381 (1.01)     375.6579 (1.0)      16.5761 (1.18)           2.6620 (1.0)    
test_observation[llvm;IrInstructionCountO0] (new-dataset-)     362.0870 (1.02)     376.0681 (1.03)     401.8593 (1.0)      379.9843 (1.01)     13.9990 (1.0)            2.6317 (0.99)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountO3] (development-)     353.3580 (1.0)      355.4808 (1.0)      414.2023 (1.04)     358.6864 (1.0)       8.9728 (1.0)            2.7880 (1.0)    
test_observation[llvm;IrInstructionCountO3] (new-dataset-)     361.3955 (1.02)     367.6248 (1.03)     397.4840 (1.0)      371.4302 (1.04)     11.8482 (1.32)           2.6923 (0.97)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCountOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                                   Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCountOz] (development-)     353.4154 (1.0)      361.9962 (1.0)      388.6243 (1.0)      363.2143 (1.0)       6.9667 (1.0)            2.7532 (1.0)    
test_observation[llvm;IrInstructionCountOz] (new-dataset-)     359.8931 (1.02)     368.2843 (1.02)     396.8973 (1.02)     373.1967 (1.03)     12.9433 (1.86)           2.6796 (0.97)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;IrInstructionCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                                 Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;IrInstructionCount] (development-)     369.4690 (1.0)      376.9218 (1.0)      433.0396 (1.04)     384.4245 (1.0)      16.3143 (1.13)           2.6013 (1.0)    
test_observation[llvm;IrInstructionCount] (new-dataset-)     369.4935 (1.00)     386.3037 (1.02)     416.9821 (1.0)      391.4656 (1.02)     14.4663 (1.0)            2.5545 (0.98)   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;Ir]': 2 tests -----------------------------------------------------------
Name (time in us)                                 Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Ir] (development-)     752.2196 (1.00)     756.0559 (1.0)      807.7845 (1.0)      765.5881 (1.0)      16.2103 (1.0)            1.3062 (1.0)    
test_observation[llvm;Ir] (new-dataset-)     751.2741 (1.0)      763.9933 (1.01)     830.2374 (1.03)     771.1941 (1.01)     17.6519 (1.09)           1.2967 (0.99)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;ObjectTextSizeBytes]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeBytes] (development-)     12.6753 (1.0)      14.7683 (1.0)      16.0970 (1.0)      14.7236 (1.0)      0.6736 (1.17)     67.9181 (1.0)    
test_observation[llvm;ObjectTextSizeBytes] (new-dataset-)     13.8054 (1.09)     14.8847 (1.01)     16.1217 (1.00)     15.0045 (1.02)     0.5768 (1.0)      66.6468 (0.98)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeO0]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeO0] (development-)     354.4941 (1.0)      359.3133 (1.0)      392.8660 (1.0)      360.2502 (1.0)       6.7107 (1.0)            2.7758 (1.0)    
test_observation[llvm;ObjectTextSizeO0] (new-dataset-)     362.1932 (1.02)     377.7409 (1.05)     401.6805 (1.02)     379.7996 (1.05)     12.2977 (1.83)           2.6330 (0.95)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeO3] (development-)     360.4684 (1.0)      362.6466 (1.0)      377.8102 (1.0)      363.3284 (1.0)       3.1129 (1.0)            2.7523 (1.0)    
test_observation[llvm;ObjectTextSizeO3] (new-dataset-)     360.4812 (1.00)     370.2278 (1.02)     401.3857 (1.06)     373.9617 (1.03)     11.8220 (3.80)           2.6741 (0.97)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_observation[llvm;ObjectTextSizeOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                               Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;ObjectTextSizeOz] (development-)     321.5930 (1.0)      355.2196 (1.0)      399.3834 (1.0)      351.7564 (1.0)      23.1039 (1.81)           2.8429 (1.0)    
test_observation[llvm;ObjectTextSizeOz] (new-dataset-)     363.6290 (1.13)     377.2168 (1.06)     404.1628 (1.01)     380.4841 (1.08)     12.7400 (1.0)            2.6282 (0.92)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_observation[llvm;Programl]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_observation[llvm;Programl] (development-)     31.1456 (1.0)      33.2769 (1.0)      51.7074 (1.17)     37.0543 (1.0)      6.5835 (3.47)     26.9874 (1.0)    
test_observation[llvm;Programl] (new-dataset-)     35.5645 (1.14)     39.9278 (1.20)     44.3395 (1.0)      40.0480 (1.08)     1.8946 (1.0)      24.9700 (0.93)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------- benchmark 'test_reset[dummy-cc]': 2 tests -----------------------------------------------------------
Name (time in us)                            Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[dummy-cc] (development-)     174.3554 (1.0)      186.7939 (1.0)      207.2290 (1.0)      188.5844 (1.0)      6.8768 (1.0)            5.3027 (1.0)    
test_reset[dummy-cc] (new-dataset-)     184.6878 (1.06)     202.2127 (1.08)     220.9869 (1.07)     201.9652 (1.07)     7.4495 (1.08)           4.9513 (0.93)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reset[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[dummy-py] (development-)     472.7083 (1.0)      531.4210 (1.0)      627.5417 (1.0)      529.1365 (1.0)      52.9727 (1.68)           1.8899 (1.0)    
test_reset[dummy-py] (new-dataset-)     506.5356 (1.07)     566.5837 (1.07)     655.9474 (1.05)     567.7660 (1.07)     31.6092 (1.0)            1.7613 (0.93)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------- benchmark 'test_reset[llvm;fast-benchmark]': 2 tests ----------------------------------------------------
Name (time in ms)                                     Min            Median               Max              Mean            StdDev                 OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[llvm;fast-benchmark] (development-)     1.1733 (1.0)      1.5876 (1.03)     2.2853 (1.40)     1.5149 (1.0)      0.2462 (6.83)     660.1302 (1.0)    
test_reset[llvm;fast-benchmark] (new-dataset-)     1.4504 (1.24)     1.5402 (1.0)      1.6347 (1.0)      1.5375 (1.01)     0.0360 (1.0)      650.3941 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reset[llvm;slow-benchmark]': 2 tests ------------------------------------------------------
Name (time in ms)                                      Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reset[llvm;slow-benchmark] (development-)     74.1077 (1.0)      80.9428 (1.0)      93.8438 (1.07)     80.9114 (1.0)      2.8289 (1.02)     12.3592 (1.0)    
test_reset[llvm;slow-benchmark] (new-dataset-)     75.0286 (1.01)     82.7532 (1.02)     87.8640 (1.0)      83.0267 (1.03)     2.7793 (1.0)      12.0443 (0.97)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[dummy-cc]': 2 tests ----------------------------------------------------------
Name (time in us)                             Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[dummy-cc] (development-)     104.5140 (1.0)      118.2826 (1.0)      125.9687 (1.0)      117.8680 (1.0)      5.3668 (1.85)           8.4841 (1.0)    
test_reward[dummy-cc] (new-dataset-)     116.8716 (1.12)     124.4331 (1.05)     132.5936 (1.05)     124.4742 (1.06)     2.8980 (1.0)            8.0338 (0.95)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                             Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[dummy-py] (development-)     285.5907 (1.0)      293.8494 (1.0)      362.0608 (1.0)      305.7921 (1.0)      20.2975 (1.0)            3.2702 (1.0)    
test_reward[dummy-py] (new-dataset-)     297.3947 (1.04)     334.5770 (1.14)     405.7645 (1.12)     335.1628 (1.10)     24.1988 (1.19)           2.9836 (0.91)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountNorm]': 2 tests -----------------------------------------------------------
Name (time in us)                                                Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountNorm] (development-)     383.9045 (1.01)     386.9356 (1.0)      430.3412 (1.03)     398.4357 (1.01)     14.3518 (1.10)           2.5098 (0.99)   
test_reward[llvm;IrInstructionCountNorm] (new-dataset-)     378.4555 (1.0)      387.6897 (1.00)     416.1819 (1.0)      392.7855 (1.0)      13.0213 (1.0)            2.5459 (1.0)    
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountO3]': 2 tests -----------------------------------------------------------
Name (time in us)                                              Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountO3] (development-)     347.1320 (1.0)      384.4029 (1.0)      411.6064 (1.0)      390.8936 (1.0)      17.9807 (1.45)           2.5582 (1.0)    
test_reward[llvm;IrInstructionCountO3] (new-dataset-)     380.8883 (1.10)     385.9082 (1.00)     417.5061 (1.01)     391.5075 (1.00)     12.3886 (1.0)            2.5542 (1.00)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCountOz]': 2 tests -----------------------------------------------------------
Name (time in us)                                              Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCountOz] (development-)     376.0134 (1.0)      377.7239 (1.0)      412.7454 (1.0)      380.8870 (1.0)       7.6893 (1.0)            2.6255 (1.0)    
test_reward[llvm;IrInstructionCountOz] (new-dataset-)     379.6122 (1.01)     387.6663 (1.03)     416.5290 (1.01)     390.8013 (1.03)     10.3978 (1.35)           2.5588 (0.97)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_reward[llvm;IrInstructionCount]': 2 tests -----------------------------------------------------------
Name (time in us)                                            Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;IrInstructionCount] (development-)     374.9172 (1.0)      409.4628 (1.05)     419.4225 (1.00)     397.4140 (1.01)     15.3283 (1.34)           2.5163 (0.99)   
test_reward[llvm;IrInstructionCount] (new-dataset-)     381.6203 (1.02)     390.0886 (1.0)      418.5234 (1.0)      393.8175 (1.0)      11.4720 (1.0)            2.5392 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeBytes]': 2 tests ------------------------------------------------------
Name (time in ms)                                            Min             Median                Max               Mean            StdDev                OPS          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeBytes] (development-)     12.3360 (1.0)      14.9160 (1.0)      16.4057 (1.02)     14.9060 (1.0)      0.7356 (1.34)     67.0869 (1.0)    
test_reward[llvm;ObjectTextSizeBytes] (new-dataset-)     13.6626 (1.11)     14.9567 (1.00)     16.1302 (1.0)      14.9730 (1.00)     0.5490 (1.0)      66.7868 (1.00)   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeNorm]': 2 tests ------------------------------------------------------
Name (time in ms)                                           Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeNorm] (development-)     13.3458 (1.07)     14.8738 (1.01)     16.2651 (1.00)     14.8633 (1.01)     0.6330 (1.0)      67.2796 (0.99)   
test_reward[llvm;ObjectTextSizeNorm] (new-dataset-)     12.4342 (1.0)      14.7895 (1.0)      16.1935 (1.0)      14.7106 (1.0)      0.6690 (1.06)     67.9784 (1.0)    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeO3]': 2 tests ------------------------------------------------------
Name (time in ms)                                         Min             Median                Max               Mean            StdDev                OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeO3] (development-)     12.7029 (1.0)      14.8305 (1.00)     16.0186 (1.01)     14.8115 (1.0)      0.6276 (1.04)     67.5152 (1.0)    
test_reward[llvm;ObjectTextSizeO3] (new-dataset-)     13.3724 (1.05)     14.7628 (1.0)      15.8921 (1.0)      14.8179 (1.00)     0.6018 (1.0)      67.4859 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_reward[llvm;ObjectTextSizeOz]': 2 tests ------------------------------------------------------
Name (time in ms)                                         Min             Median                Max               Mean            StdDev                OPS          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reward[llvm;ObjectTextSizeOz] (development-)     12.5970 (1.0)      15.0874 (1.02)     16.4650 (1.02)     14.9618 (1.02)     0.7000 (1.00)     66.8369 (0.98)   
test_reward[llvm;ObjectTextSizeOz] (new-dataset-)     12.9173 (1.03)     14.7451 (1.0)      16.0704 (1.0)      14.7057 (1.0)      0.6988 (1.0)      68.0010 (1.0)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[dummy-cc]': 2 tests ----------------------------------------------------------
Name (time in us)                           Min              Median                 Max                Mean            StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[dummy-cc] (development-)      97.4931 (1.0)      104.0608 (1.0)      108.4755 (1.0)      103.8169 (1.0)      2.9708 (1.25)           9.6323 (1.0)    
test_step[dummy-cc] (new-dataset-)     101.8252 (1.04)     105.9125 (1.02)     115.7945 (1.07)     105.9827 (1.02)     2.3805 (1.0)            9.4355 (0.98)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[dummy-py]': 2 tests -----------------------------------------------------------
Name (time in us)                           Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[dummy-py] (development-)     273.3084 (1.0)      304.8896 (1.0)      376.2387 (1.08)     297.6722 (1.0)      20.5929 (2.26)           3.3594 (1.0)    
test_step[dummy-py] (new-dataset-)     274.5965 (1.00)     309.1488 (1.01)     349.9082 (1.0)      309.3928 (1.04)      9.1237 (1.0)            3.2321 (0.96)   
----------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[llvm;fast-benchmark;fast-action]': 2 tests -----------------------------------------------------------
Name (time in us)                                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;fast-benchmark;fast-action] (development-)     412.5318 (1.0)      441.8161 (1.0)      488.9862 (1.0)      442.4692 (1.0)      15.2037 (1.0)            2.2600 (1.0)    
test_step[llvm;fast-benchmark;fast-action] (new-dataset-)     413.5961 (1.00)     444.7389 (1.01)     490.6238 (1.00)     444.3464 (1.00)     15.2365 (1.00)           2.2505 (1.00)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------- benchmark 'test_step[llvm;fast-benchmark;slow-action]': 2 tests -----------------------------------------------------------
Name (time in us)                                                  Min              Median                 Max                Mean             StdDev            OPS (Kops/s)          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;fast-benchmark;slow-action] (development-)     730.8104 (1.0)      743.6683 (1.00)     843.6199 (1.08)     746.4463 (1.0)      18.1063 (1.22)           1.3397 (1.0)    
test_step[llvm;fast-benchmark;slow-action] (new-dataset-)     737.0641 (1.01)     741.1574 (1.0)      781.5783 (1.0)      752.1123 (1.01)     14.8591 (1.0)            1.3296 (0.99)   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_step[llvm;slow-benchmark;fast-action]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;slow-benchmark;fast-action] (development-)     13.5394 (1.0)      16.5735 (1.0)      21.9988 (1.26)     16.1273 (1.0)      2.3965 (8.30)     62.0069 (1.0)    
test_step[llvm;slow-benchmark;fast-action] (new-dataset-)     16.0728 (1.19)     16.7939 (1.01)     17.5153 (1.0)      16.7986 (1.04)     0.2889 (1.0)      59.5288 (0.96)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------ benchmark 'test_step[llvm;slow-benchmark;slow-action]': 2 tests ------------------------------------------------------
Name (time in ms)                                                 Min             Median                Max               Mean            StdDev                OPS          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_step[llvm;slow-benchmark;slow-action] (development-)     62.5400 (1.0)      69.3483 (1.0)      83.5913 (1.03)     69.4179 (1.0)      3.2184 (1.08)     14.4055 (1.0)    
test_step[llvm;slow-benchmark;slow-action] (new-dataset-)     63.4310 (1.01)     69.7768 (1.01)     81.4955 (1.0)      69.8222 (1.01)     2.9710 (1.0)      14.3221 (0.99)   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Issue #45.

hughleat

LGTM

hughleat · 2021-04-28T00:40:14Z

compiler_gym/envs/compiler_env.py

-        force a specific benchmark to be chosen, set this property (or pass
-        the benchmark as an argument to :func:`reset`):
+        By default, a benchmark will be selected randomly by the service from
+        the available benchmarks on a call to :func:`reset`. To force a specific


Is there always guaranteed to be at least one available benchmark?
What happens if someone uninstalls all the datasets?

The distribution of this changes based on what you have installed. I still think removing randomness from most of these apis is better.

Ah, these docs are stale. After our chat on Monday I took your recommendation of removing randomness from the datasets in df5dcdb. Now, an environment has a default benchmark (defaults to first available next(env.datasets.benchmarks())). If there are no datasets, a TypeError is raised on call to reset()

compiler_gym/third_party/cbench/BUILD

hughleat · 2021-04-28T00:58:45Z

compiler_gym/util/minimize_trajectory.py

@@ -25,10 +24,10 @@ class MinimizationError(OSError):

 # A hypothesis is a callback that accepts as input an enivornment in a given
 # state returns true if a particular hypothesis holds, else false.
-Hypothesis = Callable[[CompilerEnv], bool]
+Hypothesis = Callable[["CompilerEnv"], bool]  # noqa: F821


I'm just curious, what do the quotes do in "CompilerEnv"

Using a string literal type hint defers its evaluation. Otherwise CompilerEnv name would be evaluated here. This is used for forward references or, in this case, to break a circular reference.

As of py3.7 there is a better way of handling this, but CompilerGym supports py3.6: https://www.python.org/dev/peps/pep-0563

This extends the micro-benchmark script to record and report runtimes for operations on the example gym service implementations in C++ and Python. The idea is that this provides a useful reference when evaluating the measurements of other environments. Since the example services do no compilation, the benchmark performances are close to a roofline.

This switches over the `CompilerEnv` environment to use the new dataset API, dropping the `LegacyDataset` class. Background ---------- Since the very first prototype of CompilerGym, a `Benchmark` protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend. Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a `GetBenchmarks()` RPC method, and allowing new benchmarks to be added using an `AddBenchmarks()` method. This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a *datasets* abstraction was added. Initial Datasets abstraction ---------------------------- To add support for managing large sets of programs, a [Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20) tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier: https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133 Problems with this approach --------------------------- (1) **Leaky abstraction** Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync. (2) **Inflexible** Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc. (3) **O(n) space and time overhead** on each service instance, where *n* is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem. New Dataset API --------------- This commit changes the ownership model so that the *Environment* owns the benchmarks and datasets, not the service. This uses the new `Dataset` class hierarchy that has been added in previous pull requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201. Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a `Benchmark` proto. The service is free to cache this for future use, and can empty the cache whenever it wants. This new approach has a few key benefits: (1) By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets. (2) Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep. (3) Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks. (4) Adds support for lazily-compiled datasets of programs that are generated on-demand. (5) Removes the need to download datasets ahead of time. Datasets can now be installed on-demand. Summary of changes ------------------ (1) Changes the type of `env.benchmark` from a string to a `Benchmark` instance. (2) Makes `env.benchmark` a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use `env.datasets.benchmark()`. (3) `env.fork()` no longer requires `env.reset()` to have been called first. It will call `env.reset()` if required. (4) `env.benchmark = None` is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using `env.datasets.benchmark_uris()` and similar. (5) Deprecates all `LegacyDataset` operations, changing their behavior to no-ops, and removing the class. (6) Renames `cBench` to `cbench` to be consistent with the lower-case naming convention of gym. The old `cBench` datasets are kept around but are marked deprecated to encourage migration. Migrating to the new interface ------------------------------ To migrate existing code to the new interface: (1) Update references to `cBench-v[01]` to `cbench-v1`. (2) Review code that accesses the `env.benchmark` property and update to `env.benchmark.uri` if a string name is required. (3) Review code that calls `env.reset()` without first setting a benchmark. Previously, calling `env.reset()` would select a random benchmark. Now, `env.reset()` always selects the last used benchmark, or a predetermined default if none is specified. (4) Review code that relies on `env.benchmark` being `None` to select benchmarks randomly. Now, `env.benchmark` is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided. (5) Remove calls to `env.require_dataset()`. Issue facebookresearch#45.

Issue facebookresearch#45.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (#190). The env.benchmark attribute is now an instance of this class rather than a string (#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (#191, #192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (#207), a dataset of OpenCL kernels (#208), and a dataset of compilable C functions (#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (#171). - [docs] We have added a Feature Stability section to the documentation (#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (facebookresearch#190). The env.benchmark attribute is now an instance of this class rather than a string (facebookresearch#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (facebookresearch#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (facebookresearch#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions (facebookresearch#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (facebookresearch#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (facebookresearch#171). - [docs] We have added a Feature Stability section to the documentation (facebookresearch#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (facebookresearch#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2021

ChrisCummins requested review from hughleat, JD-ETH, benoitsteiner and bwasti April 27, 2021 12:49

ChrisCummins mentioned this pull request Apr 27, 2021

Update docs for the new dataset API. #221

Merged

ChrisCummins force-pushed the new-dataset-api branch from 336d959 to ab01348 Compare April 27, 2021 13:43

hughleat approved these changes Apr 28, 2021

View reviewed changes

ChrisCummins force-pushed the new-dataset-api branch from 1115a10 to 70bc82c Compare April 28, 2021 10:36

ChrisCummins force-pushed the new-dataset-api branch from 70bc82c to 0b34c14 Compare April 28, 2021 16:15

ChrisCummins added 2 commits April 28, 2021 17:15

[datasets] Allow benchmark to be None at constructor time.

0260132

Issue facebookresearch#45.

ChrisCummins force-pushed the new-dataset-api branch from 0b34c14 to 0260132 Compare April 28, 2021 16:18

ChrisCummins merged commit 6f7b6ff into facebookresearch:development Apr 29, 2021

ChrisCummins deleted the new-dataset-api branch April 29, 2021 09:56

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #238

Merged

9 tasks

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #241

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Switch CompilerEnv to the new dataset API. #222

[datasets] Switch CompilerEnv to the new dataset API. #222

ChrisCummins commented Apr 27, 2021 •

edited

Loading

hughleat left a comment

hughleat Apr 28, 2021

ChrisCummins Apr 28, 2021

hughleat Apr 28, 2021

ChrisCummins Apr 28, 2021

	message Benchmark {
	// The name of the benchmark to add. In case of conflict with an existing
	// benchmark, this new benchmark replaces the existing one.
	string uri = 1;
	// The description of the program that is being compiled. It is up to the
	// service to determine how to interpret this file, and it is the
	// responsibility of the client to ensure that it provides the correct format.
	// For example, the service could expect that this file contains serialized
	// IR data, or an input source file.
	File program = 2;
	}

	"""Manage datasets of benchmarks.

	.. code-block::

	$ python -m compiler_gym.bin.datasets --env=<env> [command...]

	Where :code:`command` is one of :code:`--download=<dataset...>`,
	:code:`--activate=<dataset...>`, :code:`--deactivate=<dataset...>`,
	and :code:`--delete=<dataset...>`.


	Listing installed datasets
	--------------------------

	If run with no arguments, this command shows an overview of the datasets that
	are activate, inactive, and available to download. For example:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0
	llvm-v0 benchmarks site dir: /home/user/.local/share/compiler_gym/llvm/10.0.0/bitcode_benchmarks

	+-------------------+--------------+-----------------+----------------+
	\| Active Datasets \| License \| #. Benchmarks \| Size on disk \|
	+===================+==============+=================+================+
	\| cBench-v1 \| BSD 3-Clause \| 23 \| 10.1 MB \|
	+-------------------+--------------+-----------------+----------------+
	\| Total \| \| 23 \| 10.1 MB \|
	+-------------------+--------------+-----------------+----------------+
	These benchmarks are ready for use. Deactivate them using `--deactivate=<name>`.

	+---------------------+-----------+-----------------+----------------+
	\| Inactive Datasets \| License \| #. Benchmarks \| Size on disk \|
	+=====================+===========+=================+================+
	\| Total \| \| 0 \| 0 Bytes \|
	+---------------------+-----------+-----------------+----------------+
	These benchmarks may be activated using `--activate=<name>`.

	+------------------------+---------------------------------+-----------------+----------------+
	\| Downloadable Dataset \| License \| #. Benchmarks \| Size on disk \|
	+========================+=================================+=================+================+
	\| blas-v0 \| BSD 3-Clause \| 300 \| 4.0 MB \|
	+------------------------+---------------------------------+-----------------+----------------+
	\| polybench-v0 \| BSD 3-Clause \| 27 \| 162.6 kB \|
	+------------------------+---------------------------------+-----------------+----------------+
	These benchmarks may be installed using `--download=<name> --activate=<name>`.


	Downloading datasets
	--------------------

	Use :code:`--download` to download a dataset from the list of available datasets:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=npb-v0

	After downloading, the dataset will be activated and the benchmarks will be
	available to use by the environment.

	>>> import compiler_gym
	>>> import gym
	>>> env = gym.make("llvm-v0")
	>>> env.benchmark = "npb-v0"

	The flag :code:`--download_all` can be used to download every available dataset:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download_all

	:code:`--download` accepts the URL of any :code:`.tar.bz2` file to support custom datasets:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --download=https://example.com/dataset.tar.bz2

	Or use the :code:`file:///` URI to install a local archive file:

	.. code-block::

	$ python -m compiler_gym.bin.benchmarks --env=llvm-v0 --download=file:////tmp/dataset.tar.bz2

	The list of datasets that are available to download may be extended by calling
	:meth:`CompilerEnv.register_dataset() <compiler_gym.envs.CompilerEnv.register_dataset>`
	on a :code:`CompilerEnv` instance.

	To programmatically download datasets, see
	:meth:`CompilerEnv.require_dataset() <compiler_gym.envs.CompilerEnv.require_dataset>`.

	Activating and deactivating datasets
	------------------------------------

	Datasets have two states: active and inactive. An inactive dataset still exists
	locally on the filesystem, but is excluded from use by CompilerGym environments.
	This be useful if you have many datasets downloaded and you would to limit the
	benchmarks that can be selected randomly by an environment.

	Activate or deactivate datasets using the :code:`--activate` and :code:`--deactivate`
	flags, respectively:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate=npb-v0,github-v0 --deactivate=cBench-v1

	The :code:`--activate_all` and :code:`--deactivate_all` flags can be used as a
	shortcut to activate or deactivate every downloaded:

	.. code-block::

	# Activate all inactivate datasets:
	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --activate_all
	# Make all activate datasets inactive:
	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --deactivate_all

	Deleting datasets
	-----------------

	To remove a dataset from the filesystem, use :code:`--delete`:

	.. code-block::

	$ python -m comiler_gym.bin.benchmarks --env=llvm-v0 --delete=npb-v0

	Once deleted, a dataset must be downloaded before it can be used again.

	A :code:`--delete_all` flag can be used to delete all of the locally installed
	datasets.
	"""

[datasets] Switch CompilerEnv to the new dataset API. #222

[datasets] Switch CompilerEnv to the new dataset API. #222

Conversation

ChrisCummins commented Apr 27, 2021 • edited Loading

Background

Initial Datasets abstraction

Problems with this approach

New Dataset API

Summary of changes

Migrating to the new interface

Performance impact

hughleat left a comment

Choose a reason for hiding this comment

hughleat Apr 28, 2021

Choose a reason for hiding this comment

ChrisCummins Apr 28, 2021

Choose a reason for hiding this comment

hughleat Apr 28, 2021

Choose a reason for hiding this comment

ChrisCummins Apr 28, 2021

Choose a reason for hiding this comment

ChrisCummins commented Apr 27, 2021 •

edited

Loading