Skip to content

Commit

Permalink
Update doc based on design review
Browse files Browse the repository at this point in the history
  • Loading branch information
aaudiber committed Jan 31, 2020
1 parent 4c005e7 commit 2f698f9
Showing 1 changed file with 20 additions and 15 deletions.
35 changes: 20 additions & 15 deletions rfcs/20200113-tf-data-service.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Distributed tf.data service

| Status | Proposed |
| Status | Accepted |
| :------------ | :------------------------------------------------------ |
| **RFC #** | [195](https://github.com/tensorflow/community/pull/195) |
| **Author(s)** | Andrew Audibert ([email protected]) Rohan Jain ([email protected]) |
| **Sponsor** | Jiri Simsa ([email protected]) |
| **Updated** | 2019-01-24 |
| **Updated** | 2019-01-30 |

## Objective

Expand Down Expand Up @@ -143,14 +143,16 @@ here to implement datasets which produce per-replica elements, enabling
idiomatic control flow.

```python
def tf.data.experimental.service.distribute(address):
def tf.data.experimental.service.distribute(address_or_resolver):
"""Marks that a dataset should be processed by the tf.data service.
ds = ... # dataset to distribute
ds = ds.apply(tf.data.experimental.service.distribute(address))
ds = ds.apply(
tf.data.experimental.service.distribute(address_or_resolver))
Args:
address: The address of the tf.data service master.
address_or_resolver: The address of the tf.data service master, or a
cluster resolver that can be used to determine the master address.
Returns:
A function that can be passed to `dataset.apply()`.
Expand Down Expand Up @@ -622,22 +624,25 @@ service. We will also provide a tutorial for using the tf.data service.
* How should we communicate that distributing a dataset will change the order
in which elements are processed? If users' datasets rely on elements being
processed in a certain order, they could face unpleasant surprises.
- Current plan is to address this through documentation.
- Final decision: Address this through documentation.
* Should we support splitting `skip`, `take`, and `scan` by having them
operate at a per-task level (e.g. skip or take the first `N` elements within
each task)?
- Leaning towards supporting these operations at a per-task level. This is
consistent with how skip/take/scan behave today when using distribution
strategies to distribute a dataset.
- Final decision: Prohibit distributing these transformations, and tell
users to instead use these transformations *after* applying the
`distribute` transformation.
* Is there a more user-friendly way to share iteration ids across consumers?
Distribution strategy is well-equipped with collective ops to share the
iteration ids, but sharing the iteration id could be a heavy burden for
some users.
- Distributing iteration ids is simple in the common case where a single
process builds the graph. If users are advanced enough to do distributed
training without distribution strategies, they will likely have a
different mechanism available for distributing iteration ids.
- Final decision: It is a reasonable expectation for users to either use
distribution strategies, or distribute their own iteration ids.
TensorFlow will soon have public APIs for collective operations that
would make it easy to broadcast iteration ids.
* Can `service.distribute` take a `ClusterResolver` so that the master
hostname isn't baked into the dataset definition?
- We can achieve this by having the `distribute` transformation take a
master_address_or_resolver.
- Final decision: Accept `master_address_or_resolver`, and wait to resolve
the master address until iteration begins. The `ClusterResolver` will be
stored in the Python `Dataset` object. In the future, we may want C++
implementations of `ClusterResolver` so that we can represent the
resolver within the dataset graph.

0 comments on commit 2f698f9

Please sign in to comment.