forked from tensorflow/community
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
20 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
# Distributed tf.data service | ||
|
||
| Status | Proposed | | ||
| Status | Accepted | | ||
| :------------ | :------------------------------------------------------ | | ||
| **RFC #** | [195](https://github.com/tensorflow/community/pull/195) | | ||
| **Author(s)** | Andrew Audibert ([email protected]) Rohan Jain ([email protected]) | | ||
| **Sponsor** | Jiri Simsa ([email protected]) | | ||
| **Updated** | 2019-01-24 | | ||
| **Updated** | 2019-01-30 | | ||
|
||
## Objective | ||
|
||
|
@@ -143,14 +143,16 @@ here to implement datasets which produce per-replica elements, enabling | |
idiomatic control flow. | ||
|
||
```python | ||
def tf.data.experimental.service.distribute(address): | ||
def tf.data.experimental.service.distribute(address_or_resolver): | ||
"""Marks that a dataset should be processed by the tf.data service. | ||
ds = ... # dataset to distribute | ||
ds = ds.apply(tf.data.experimental.service.distribute(address)) | ||
ds = ds.apply( | ||
tf.data.experimental.service.distribute(address_or_resolver)) | ||
Args: | ||
address: The address of the tf.data service master. | ||
address_or_resolver: The address of the tf.data service master, or a | ||
cluster resolver that can be used to determine the master address. | ||
Returns: | ||
A function that can be passed to `dataset.apply()`. | ||
|
@@ -622,22 +624,25 @@ service. We will also provide a tutorial for using the tf.data service. | |
* How should we communicate that distributing a dataset will change the order | ||
in which elements are processed? If users' datasets rely on elements being | ||
processed in a certain order, they could face unpleasant surprises. | ||
- Current plan is to address this through documentation. | ||
- Final decision: Address this through documentation. | ||
* Should we support splitting `skip`, `take`, and `scan` by having them | ||
operate at a per-task level (e.g. skip or take the first `N` elements within | ||
each task)? | ||
- Leaning towards supporting these operations at a per-task level. This is | ||
consistent with how skip/take/scan behave today when using distribution | ||
strategies to distribute a dataset. | ||
- Final decision: Prohibit distributing these transformations, and tell | ||
users to instead use these transformations *after* applying the | ||
`distribute` transformation. | ||
* Is there a more user-friendly way to share iteration ids across consumers? | ||
Distribution strategy is well-equipped with collective ops to share the | ||
iteration ids, but sharing the iteration id could be a heavy burden for | ||
some users. | ||
- Distributing iteration ids is simple in the common case where a single | ||
process builds the graph. If users are advanced enough to do distributed | ||
training without distribution strategies, they will likely have a | ||
different mechanism available for distributing iteration ids. | ||
- Final decision: It is a reasonable expectation for users to either use | ||
distribution strategies, or distribute their own iteration ids. | ||
TensorFlow will soon have public APIs for collective operations that | ||
would make it easy to broadcast iteration ids. | ||
* Can `service.distribute` take a `ClusterResolver` so that the master | ||
hostname isn't baked into the dataset definition? | ||
- We can achieve this by having the `distribute` transformation take a | ||
master_address_or_resolver. | ||
- Final decision: Accept `master_address_or_resolver`, and wait to resolve | ||
the master address until iteration begins. The `ClusterResolver` will be | ||
stored in the Python `Dataset` object. In the future, we may want C++ | ||
implementations of `ClusterResolver` so that we can represent the | ||
resolver within the dataset graph. |