Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets-overhaul #3] Add Dataset helper subclasses. #192

Merged
merged 1 commit into from
Apr 22, 2021

Conversation

ChrisCummins
Copy link
Contributor

@ChrisCummins ChrisCummins commented Apr 21, 2021

This adds three helper classes for describing datasets:

  • FilesDataset - A dataset comprising a directory tree files.
  • TarDataset - A dataset comprising a files tree stored in a tar archive.
  • TarDatasetWithManifest - A tarball-based dataset that has a manifest file which lists the URIs.

Issue #45.

@ChrisCummins ChrisCummins added this to the v0.1.8 milestone Apr 21, 2021
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2021
@ChrisCummins ChrisCummins requested a review from JD-ETH April 22, 2021 13:25
compiler_gym/datasets/files_dataset.py Show resolved Hide resolved
compiler_gym/datasets/files_dataset.py Outdated Show resolved Hide resolved
compiler_gym/datasets/files_dataset.py Outdated Show resolved Hide resolved
# There aren't enough files in this directory to bring us up to
# the target file index, so skip the whole lot.
i += len(valid_files)
dirs.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.Why sort?
2. if there isn't enough files in the dataset_root compared to n, why not throw immediately? And why would i not be equal to 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sort to keep iteration order consistent (see above).
  2. os.walk() returns a list of files in the current directory, and a list of subdirectories that will be visited next. If we don't have enough files in the current directory, then we can descend into the subdirectories.

compiler_gym/datasets/tar_dataset.py Outdated Show resolved Hide resolved
elif self.manifest_compression == "gz":
with gzip.GzipFile(fileobj=manifest_data) as f:
manifest_data = f.read()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifest sounds like a metadata for me, and why can't it just be plain txt file?
I guess you have specific dataset example in mind here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, manifests mean "lists of stuff", but it is an overloaded term. In bazel, a manifest is a list of file paths, one per line. In python packaging, a manifest is a file that has its own special syntax and commands. I'm using the term "manifest" here the same way that bazel does. I'll add an explanation in the docstring to make it clear.

compiler_gym/datasets/tar_dataset.py Outdated Show resolved Hide resolved
tests/datasets/files_dataset_test.py Show resolved Hide resolved
compiler_gym/datasets/files_dataset.py Show resolved Hide resolved
compiler_gym/datasets/files_dataset.py Outdated Show resolved Hide resolved
@ChrisCummins ChrisCummins force-pushed the dataset branch 2 times, most recently from 3455935 to f0b28e7 Compare April 22, 2021 15:39
@ChrisCummins ChrisCummins force-pushed the dataset-classes branch 2 times, most recently from 1f6ff8b to bf94c4a Compare April 22, 2021 16:46
@ChrisCummins
Copy link
Contributor Author

Thanks for the very detailed review @JD-ETH! I have gone through and updated the PR. Lemme know if that LGTY.

Cheers,
Chris

Base automatically changed from dataset to development April 22, 2021 21:14
@ChrisCummins ChrisCummins merged commit b26f512 into development Apr 22, 2021
@ChrisCummins ChrisCummins deleted the dataset-classes branch April 22, 2021 21:18
ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this pull request Apr 28, 2021
This switches over the `CompilerEnv` environment to use the new
dataset API, dropping the `LegacyDataset` class.

Background
----------

Since the very first prototype of CompilerGym, a `Benchmark` protocol
buffer has been used to provide a serializable representation of
benchmarks that can be passed back and forth between the service and
the frontend.

Initially, it was up to the compiler service to maintain the set of
available benchmarks, exposing the available benchmarks with a
`GetBenchmarks()` RPC method, and allowing new benchmarks to be added
using an `AddBenchmarks()` method.

This was fine for the initial use case of shipping a handful of
benchmarks and allowing ad-hoc new benchmarks to be added, but for
managing larger sets of benchmarks, a *datasets* abstraction was
added.

Initial Datasets abstraction
----------------------------

To add support for managing large sets of programs, a
[Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20)
tuple was added that describes a set of programs, and a link to the a
tarball containing those programs. The tarball is required to have a
JSON file containing metadata, and a directory containing the
benchmarks, one file per benchmark. A set of operations were added to
the frontend command line to make downloading and unpacking these
tarballs easier:

https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133

Problems with this approach
---------------------------

(1) **Leaky abstraction** Both the environment and backend service
have to know about datasets. This means redundant duplicated logic,
and adds a maintenance burden of keeping the C++/python logic in sync.

(2) **Inflexible** Only supports environments in which a single file
represents a benchmark. No support for multi-file benchmarks,
benchmarks that are compiled on-demand, etc.

(3) **O(n) space and time overhead** on each service instance, where *n*
is the total number of benchmarks. At init time, each service needs to
recursively scan a directory tree to build a list of available
benchmarks. This list must be kept in memory. This adds startup time,
and also causes cache invalidation issues when multiple environment
instances are modifying the underlying filesystem.

New Dataset API
---------------

This commit changes the ownership model so that the *Environment* owns
the benchmarks and datasets, not the service. This uses the new
`Dataset` class hierarchy that has been added in previous pull
requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201.

Now, the backend has no knowledge of "datasets". Instead the service
simply keeps a small cache of benchmarks that it has seen. If a
session request has a benchmark URI that is not in this cache, the
service returns a "resource not found" error and the frontend logic
can then respond by sending it a copy of the benchmark as a
`Benchmark` proto. The service is free to cache this for future use,
and can empty the cache whenever it wants.

This new approach has a few key benefits:

(1) By moving all of the datasets logic into the frontend, it becomes
much easier for users to define their own datasets.

(2) Reduces compiler service startup time as it removes the need for
each service to do a recursive filesystem sweep.

(3) Removes the requirement that the set of benchmarks is fully
enumerable, allow for program generators that can produce a
theoretically infinite number of benchmarks.

(4) Adds support for lazily-compiled datasets of programs that are
generated on-demand.

(5) Removes the need to download datasets ahead of time. Datasets can
now be installed on-demand.

Summary of changes
------------------

(1) Changes the type of `env.benchmark` from a string to a `Benchmark`
instance.

(2) Makes `env.benchmark` a mandatory attribute. If no benchmark is
provided at init time, one is chosen deterministically. If you wish to
select a random benchmark, use `env.datasets.benchmark()`.

(3) `env.fork()` no longer requires `env.reset()` to have been called
first. It will call `env.reset()` if required.

(4) `env.benchmark = None` is no longer a valid way of requesting a
random benchmark. If you would like a random benchmark, you must now
roll your own random picker using `env.datasets.benchmark_uris()` and
similar.

(5) Deprecates all `LegacyDataset` operations, changing their behavior
to no-ops, and removing the class.

(6) Renames `cBench` to `cbench` to be consistent with the lower-case
naming convention of gym. The old `cBench` datasets are kept around
but are marked deprecated to encourage migration.

Migrating to the new interface
------------------------------

To migrate existing code to the new interface:

(1) Update references to `cBench-v[01]` to `cbench-v1`.

(2) Review code that accesses the `env.benchmark` property and update
to `env.benchmark.uri` if a string name is required.

(3) Review code that calls `env.reset()` without first setting a
benchmark. Previously, calling `env.reset()` would select a random
benchmark. Now, `env.reset()` always selects the last used benchmark,
or a predetermined default if none is specified.

(4) Review code that relies on `env.benchmark` being `None` to select
benchmarks randomly. Now, `env.benchmark` is always set to the
previously used benchmark, or a predetermined default benchmark if
none has been provided.

(5) Remove calls to `env.require_dataset()`.

Issue facebookresearch#45.
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
ChrisCummins added a commit that referenced this pull request Apr 30, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (#191, #192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (#207), a dataset of
    OpenCL kernels (#208), and a dataset of compilable C functions
    (#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (#171).
-   [docs] We have added a Feature Stability section to the
    documentation (#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
bwasti pushed a commit to bwasti/CompilerGym that referenced this pull request Aug 3, 2021
This switches over the `CompilerEnv` environment to use the new
dataset API, dropping the `LegacyDataset` class.

Background
----------

Since the very first prototype of CompilerGym, a `Benchmark` protocol
buffer has been used to provide a serializable representation of
benchmarks that can be passed back and forth between the service and
the frontend.

Initially, it was up to the compiler service to maintain the set of
available benchmarks, exposing the available benchmarks with a
`GetBenchmarks()` RPC method, and allowing new benchmarks to be added
using an `AddBenchmarks()` method.

This was fine for the initial use case of shipping a handful of
benchmarks and allowing ad-hoc new benchmarks to be added, but for
managing larger sets of benchmarks, a *datasets* abstraction was
added.

Initial Datasets abstraction
----------------------------

To add support for managing large sets of programs, a
[Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20)
tuple was added that describes a set of programs, and a link to the a
tarball containing those programs. The tarball is required to have a
JSON file containing metadata, and a directory containing the
benchmarks, one file per benchmark. A set of operations were added to
the frontend command line to make downloading and unpacking these
tarballs easier:

https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133

Problems with this approach
---------------------------

(1) **Leaky abstraction** Both the environment and backend service
have to know about datasets. This means redundant duplicated logic,
and adds a maintenance burden of keeping the C++/python logic in sync.

(2) **Inflexible** Only supports environments in which a single file
represents a benchmark. No support for multi-file benchmarks,
benchmarks that are compiled on-demand, etc.

(3) **O(n) space and time overhead** on each service instance, where *n*
is the total number of benchmarks. At init time, each service needs to
recursively scan a directory tree to build a list of available
benchmarks. This list must be kept in memory. This adds startup time,
and also causes cache invalidation issues when multiple environment
instances are modifying the underlying filesystem.

New Dataset API
---------------

This commit changes the ownership model so that the *Environment* owns
the benchmarks and datasets, not the service. This uses the new
`Dataset` class hierarchy that has been added in previous pull
requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201.

Now, the backend has no knowledge of "datasets". Instead the service
simply keeps a small cache of benchmarks that it has seen. If a
session request has a benchmark URI that is not in this cache, the
service returns a "resource not found" error and the frontend logic
can then respond by sending it a copy of the benchmark as a
`Benchmark` proto. The service is free to cache this for future use,
and can empty the cache whenever it wants.

This new approach has a few key benefits:

(1) By moving all of the datasets logic into the frontend, it becomes
much easier for users to define their own datasets.

(2) Reduces compiler service startup time as it removes the need for
each service to do a recursive filesystem sweep.

(3) Removes the requirement that the set of benchmarks is fully
enumerable, allow for program generators that can produce a
theoretically infinite number of benchmarks.

(4) Adds support for lazily-compiled datasets of programs that are
generated on-demand.

(5) Removes the need to download datasets ahead of time. Datasets can
now be installed on-demand.

Summary of changes
------------------

(1) Changes the type of `env.benchmark` from a string to a `Benchmark`
instance.

(2) Makes `env.benchmark` a mandatory attribute. If no benchmark is
provided at init time, one is chosen deterministically. If you wish to
select a random benchmark, use `env.datasets.benchmark()`.

(3) `env.fork()` no longer requires `env.reset()` to have been called
first. It will call `env.reset()` if required.

(4) `env.benchmark = None` is no longer a valid way of requesting a
random benchmark. If you would like a random benchmark, you must now
roll your own random picker using `env.datasets.benchmark_uris()` and
similar.

(5) Deprecates all `LegacyDataset` operations, changing their behavior
to no-ops, and removing the class.

(6) Renames `cBench` to `cbench` to be consistent with the lower-case
naming convention of gym. The old `cBench` datasets are kept around
but are marked deprecated to encourage migration.

Migrating to the new interface
------------------------------

To migrate existing code to the new interface:

(1) Update references to `cBench-v[01]` to `cbench-v1`.

(2) Review code that accesses the `env.benchmark` property and update
to `env.benchmark.uri` if a string name is required.

(3) Review code that calls `env.reset()` without first setting a
benchmark. Previously, calling `env.reset()` would select a random
benchmark. Now, `env.reset()` always selects the last used benchmark,
or a predetermined default if none is specified.

(4) Review code that relies on `env.benchmark` being `None` to select
benchmarks randomly. Now, `env.benchmark` is always set to the
previously used benchmark, or a predetermined default benchmark if
none has been provided.

(5) Remove calls to `env.require_dataset()`.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this pull request Aug 3, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (facebookresearch#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (facebookresearch#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (facebookresearch#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (facebookresearch#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of
    OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions
    (facebookresearch#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (facebookresearch#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (facebookresearch#171).
-   [docs] We have added a Feature Stability section to the
    documentation (facebookresearch#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (facebookresearch#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants