Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: tf.data Snapshot #193
RFC: tf.data Snapshot #193
Changes from 8 commits
42714cb
958ae63
2bee40f
d941c33
beff086
a7a7c5b
149a6f9
0b28765
19f3a62
9a13bd4
41396e8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either
dataset = dataset.apply(tf.data.experimental.snapshot(...))
ordataset = dataset.snapshot(...)
(I would prefer the former before providing backwards compatibility guarantee for the API).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why apply --> snapshot? Is there ever a case where you would snapshot without the .apply call? If not, can we include the apply under the hood, and let the users just dataset.snapshot(...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now .apply(...) is the only way to support an experimental tf.data transformation. So the plan would be to expose this as apply and then in a few months, we can promote this to a dataset.snapshot API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cache provide any advantages over snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache can be in memory, which is useful for testing and small datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache can be on-disk as well. Its lifetime is more confined (to a specific job), and no worries about crash-safe semantics. Lifetime of snapshots could be unbounded, and I guess that's why we need to explicitly expire them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like save/load fits better into existing TF paradigms, and makes the behavior more explicit. Looking at the pipeline above, it's hard to tell what is going to be run on an arbitrary run of the pipeline. You could imagine an individual pipeline doing a load first, then if no file to load, do the preproc/save. That's explicit and doesn't require two separate runners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with that is the logic that determines whether the input pipeline we want to load or not is not as simple as whether a file exists (the control aspect of the op). Following your train of thought, I think the better alternative would be splitting it up as reader, writer and control as separate ops (which as I indicated above should be the implementation for the next iteration of the op).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design was chosen so that the user doesn't have to worry about changing their existing pipelines much at all, and can benefit from this just by dropping snapshot in.
With
save()
andload()
, users will have to manage saving and loading by themselves and this potentially introduces errors (e.g. users may want accidentally load an old version of a snapshot without realizing it).That said, perhaps we can re-use the C++ code that implements this into more generic Dataset.save and Dataset.load ops for users who want that sort of control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @rohan100jain that for the next internal implementation we can expose save/load/control where appropriate while retaining this python interface for existing users who might want a simple solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider that the semantics of save/load are much easier to explain to people specially when you consider the interaction of shuffle_on_read with other options earlier in the input pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During earlier iterations of developing this, we considered save / load style APIs but to make that usable is quite challenging (i.e. the figuring out whether to save a new snapshot or to load). The current API serves a very concrete use case (for some significant internal users) and we feel it makes sense to expose this API to serve that use case.
Your concerns about shuffle_on_read etc. are valid and we'll address them by allowing users to specify a reader_fn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alextp @rohan100jain I have updated the design doc to add a new
reader_fn
and remove existing reader parameters. PTAL, thanks!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this align/not align with existing ckpt + saved model semantics? Is there more we can reuse here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think SavedModel supports comparisons between graphs (or in this case, parts of a graph), which is our main use case here. In general, I am not sure TensorFlow can guarantee to produce the exact same graph (i.e. with the same node names, function names, etc...), so we are utilizing the
HashGraph
function (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/dataset_utils.cc#L733), which computes a fingerprint while ignoring node names and other things that may vary run by run.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, though I meant the file structure itself and the serialized format, not strictly the fingerprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good point Karmel and we considered this while first developing. Here is why we ended up choosing something of our own -
The checkpointing API is for saving the state of the model and here we're trying to store the entire output of an input pipeline and so API wise there isn't much alignment.
On the implementation side, Checkpointing uses BundleReader / BundleWriter which are designed for key / value accesses to tensors whereas here we want to sequentially read off tensors as soon as possible. We tried Bundle Reader / Writer at first but abandoned it for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of sacrilege-- can we improve the performance of Bundle Reader / Writer instead of just building a new way of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: with a little bit tweak I managed to get the performance of tensor bundle from 4.5k image/s to 7.2k image/s using 128 threads. Writes are scattered into multiple shards so it wouldn't be too hard to get maximal performance. Up till now the only constrtaint seems to be my 10GbE NIC, and I plan to do another benchmark on a machine with 25GbE network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karmel While I think its a worthy objective of unifying the Bundle Reader / Writer file format with what we're proposing, I don't want to tie them both together this early. The reason is that since the workloads are quite different, we are still early days into exactly knowing what sort of options and functionality we'll have to build in to get the maximum throughput (e.g. we've needed compression so that we don't saturate network bandwidth too quickly, different threading implementations etc.). So my proposal would be to keep it separate at the moment, gain some experience tuning this workload and then we can pull in some of the learnings into Bundle Reader / Writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan100jain I completely agree with you. At the end of day we will be all better off with a format that yields highest performance possible, and we will definitely learned how to build one when experimenting a new format built from scratch.
Being said that, I would also like to keep improving tensor bundle, as the current on-disk cache uses it as underlying storage. @karmel would you be interested in sponsoring an RFC regarding to that? It could serve as a primary input source (TFIndexedRecord), accelerating on-disk CacheDataset, and enhance ShuffleDataset with external shuffle. We could focus on the API first. If Rohan and @frankchn designed a more performant format it could be used for these APIs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @byronyi -- @rohan100jain / @frankchn , what was the conclusion from the design review as to input format + readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karmel -- We are going to re-investigate TFRecordReader performance and see if we can bring that up to equivalency with our implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you envision this a "sink" configuration in the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katsiapis Sorry, can you clarify what sink you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Gus is referring to allowing users to customize the writing data format etc. via some kind of sink configuration. I think as of now, we don't plan to allow for that level of customization because right now its good to have our own reader / writer that is optimized for performance. Its definitely possible that in the future we allow users to customize this though but I think that won't happen in the near future. Will keep this in mind though.