-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: tf.data Service #195
RFC: tf.data Service #195
Conversation
2ef7231
to
3cf2c06
Compare
| :------------ | :------------------------------------------------------ | | **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) | : : (update when you have community PR #) : | **Author(s)** | Andrew Audibert ([email protected]) Rohan Jain | : : ([email protected]) : | **Sponsor** | Jiri Simsa ([email protected]) | | **Updated** | 2019-01-13 Provide an API and implementation of a tf.data service which can process tf.data datasets in a distributed manner. The service can be run outside the TensorFlow cluster or be exported as a gRPC service by TensorFlow servers. Goals: - Enable horizontal scaling of dataset computation to improve performance of input-bound dataset pipelines. - Improve tf.data integration with the tf.distribute API. In particular, support dynamic sharding of data across multiple processes. - Provide visitation guarantees for distributed training jobs.
3cf2c06
to
98c7b77
Compare
It is expected that the dataset contains at least one `.distribute(address)` | ||
transformation, otherwise this method will print a warning and do nothing. | ||
|
||
`create_iteration` will first register the dataset with the tf.data service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does create_iteration find the service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service address is configured within the dataset by calling dataset.apply(tf.data.experimental.service.distribute(address))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider allowing users to specify a ClusterResolver here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. A ClusterResolver would be helpful.
This section calls out caveats that users will need to be aware of when using | ||
the tf.data service. | ||
|
||
- Due to the nature of dataset splitting, elements will not be processed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to document an actual guarantee, such that the order within each split must be consistent with the original global order of the source dataset, but no promises are made around ordering across splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have such a guarantee because the dataset pipeline for each task processes multiple splits, so even the order within the split could be different than the order if the split was processed in isolation.
Update the proposal to support exactly-once visitation even when the service is executing non-deterministically. Also, add discussion of the visitation guarantees provided when the dataset produces outputs non-deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aaudiber, @rohan100jain and @jsimsa for this RFC!! The distributed dataset is really cool and can speed up both the training & inference and simplify the implementation of distributed data pipeline! A few comments are added above.
N input workers to feed M accelerators. The number of input workers can be | ||
scaled up or down as needed to keep up with the accelerators. | ||
|
||
### Distributed training requires a distribution-aware input pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distributed dataset will be very useful for the inference as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @feihugis for the thoughtful comments
provides dataset elements to consumers over RPC. | ||
|
||
**Consumer**: A machine which consumes data from the tf.data service. The | ||
consumer may be attached to a GPU or TPU, or use data for on-CPU training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to include an example configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of configuration are you interested in?
This looks great! I know this is not the goal of this proposal, but would it be possible (in the future) to build on this and have the training jobs consume the datasets from any a gRPC service (not necessarily one generated using tensorflow) as long as it conforms to a certain API ? |
@pavanky Checkout tensorflow/io#206 |
Thanks! |
Meeting notes from design review on 1/30/20 (thank you @rohan100jain for taking these!): Changes to do the doc since mailed out
Distribution may change the order of elements being produced by the dataset. How do we communicate this? Current plan is documentation
skip, take and scan might not be splittable e.g. take: do it per task and so we end up with 10 * num_worker elements
decision: prohibit and ask users to chain it afterwards Share iteration ids between tasks
decision: use collective ops to broadcast the iteration ids. They will be publicly available soon. service.distribute takes in a ClusterResolver instead of a master address
decision: allow ClusterResolver as input and change API arg to master_address_or_resolver azaks: is the order deterministic?
|
Congratulations to acceptance of this RFC! Can't wait to see it implemented in TF core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and apologies for the delayed review!
Args: | ||
dataset: The dataset to begin iteration over. | ||
num_consumers: The number of consumers to divide the dataset between. Set | ||
this if you require determinism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording of his suggests it is optional. Should its default value be None to imply Auto, or is this actually not optional in which case we shouldn't mention that this needs to be set for determinism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user wants determinism, they need to tell the tf.data service ahead of time how many consumers they will use for reading. Otherwise, the user doesn't need to worry about num_consumers
. They can leave it as None
, and read with as many consumers as they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default then be None (instead of 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought it was None
already but you are right! It should be None
.
producing deterministic results. | ||
deterministic: Whether the iteration should be performed | ||
deterministically. Fully deterministic output also requires setting | ||
`num_tasks` to a fixed number, and that the input dataset is itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and num_tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and num_consumers
task. Normally it is best to leave this as None so that the master can | ||
choose a reasonable number of tasks. Setting `num_tasks` is useful for | ||
producing deterministic results. | ||
deterministic: Whether the iteration should be performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So setting both num_tasks and num_consumers and having a deterministic dataset does not suffice, so we also need to set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you're getting at; deterministic
is superfluous since if users are setting num_tasks
and num_consumers
, they clearly care about determinism. I think we can improve this API by splitting create_iteration
into create_iteration(dataset)
and create_deterministic_iteration(dataset, num_tasks, num_consumers)
producing deterministic results. | ||
deterministic: Whether the iteration should be performed | ||
deterministically. Fully deterministic output also requires setting | ||
`num_tasks` to a fixed number, and that the input dataset is itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is set to true (and num_tasks and num_consumers are also set) can we somehow automatically detect non-determinism at the dataset level at construction time or runtime (and raise appropriate error(s))? I could see this saving someone a lot of debugging (eg they expect determinism but they don't in fact get it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be a good sanity check. In general we don't have a way to detect non-determinism in a dataset, but one thing we can do today is validate that the experimental_deterministic
dataset option isn't set to False
.
* Minimize Surprises: Users write their datasets as though they will not be | ||
split, so introducing splitting can easily lead to unexpected outcomes. To | ||
mitigate this, we will be conservative about which dataset transformations | ||
support splitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the proposal to have a whitelist of "splittable" transformations, and only if a Dataset consists of only splittable transformations can it be split?
Or are you thinking of a blacklist of non-splittable transformations instead?
In other distributed systems I've seen, splittability is usually a function of the source (not the transformations) and most user-defined transformations don't in fact break splittability (so a blacklist feels a bit more natural).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're taking a conservative whitelisting approach to avoid unexpected behavior. We whitelist a dataset as splittable if processing and re-combining its splits gives almost identical results to processing the original dataset.
In most distributed systems, the pipeline writers are aware that what they are writing will be executed in a distributed fashion. With tf.data service, most users are expected to write their datasets with a single-host mental model.
```cpp | ||
class SplitGenerator { | ||
public: | ||
virtual Status GetNext(std::unique_ptr<Split>* split, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: StatusOr?
Similarly elsewhere that this applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth considering, but note that Status
is consistent with the vast majority of TensorFlow core.
#### Supported Datasets | ||
|
||
Not all dataset sources and transformations are easily splittable. For example, | ||
`take`, `skip`, and `scan` require a global view of the dataset to produce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me why take and skip are fundamentally not splittable. I've seen them be splittable in other distributed systems.
correct results. Datasets which require multiple input datasets such as `zip` | ||
are also difficult to support, since we don't have a good way of aligning the | ||
splits of multiple input datasets. Users who rely on these unsupported datasets | ||
will need to move those datasets to come after the distributed part of their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to make this clear in the high level api above, where we can mention that anything that comes after the ds.apply(distribute) is not in fact distributed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
being strongly consistent . Users may opt to use a filesystem that doesn't | ||
support strong consistency, but they do so at the risk of two concurrently | ||
running masters thinking they are leader. Common filesystems such as POSIX, | ||
HDFS, and GCS support such strong consistency, but S3 does not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frequently they have read-after-write consistency. So if we could remove the "list_directory" part of the protocol above (and instead communicate information otherwise) we might be able to support more filesystems.
[TFX](https://www.tensorflow.org/tfx). A framework can make leveraging the | ||
tf.data service as simple as toggling a configuration boolean, triggering the | ||
framework to bring up tf.data service servers and add a | ||
`tf.data.experimental.service.distribute` transformation at the end of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, up to now my understanding wast that tf.data.service itself would do the horizontal auto-scaling.
Here it's suggested that needs to happen externally? Or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tooling could be built around the tf.data service so that it can automatically scale, but I wouldn't consider that part of the core tf.data service. To enable autoscaling, the tf.data service would report whether it could use more resources, and it is up to external tooling to start more tf.data servers.
Would this also help address memory issues associated with caching large datasets? In a distributed tf.data architecture, is the dataset cache distributed without overlap across the workers? |
@thisisandreeeee take a look at #193. |
Comment period is open till 1/27/2020.
Provide an API and implementation of a tf.data service which can process tf.data
datasets in a distributed manner. The service can be run outside the TensorFlow
cluster or be exported as a gRPC service by TensorFlow servers.
Goals:
input-bound dataset pipelines.
support dynamic sharding of data across multiple processes.