Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: tf.data Snapshot #193

Merged
merged 11 commits into from
Feb 11, 2020
Merged

RFC: tf.data Snapshot #193

merged 11 commits into from
Feb 11, 2020

Conversation

frankchn
Copy link
Contributor

@frankchn frankchn commented Jan 7, 2020

Comment period is open till 1/20/2020.

tf.data Snapshot

Status Accepted
RFC # 193
Author(s) Frank Chen ([email protected]), Rohan Jain
([email protected])
Sponsor Jiri Simsa ([email protected])
Updated 2020-02-10

Objective

With ever faster accelerators available in Cloud and hyperparameter tuning
consuming larger chunks of accelerator time, TensorFlow users are increasingly
finding that they don’t have enough CPU resources to keep up with these
accelerators, leaving valuable accelerator resources idle.

To alleviate this problem, we are proposing a snapshot API within tf.data,
to allow users to transparently persist the output of their preprocessing
pipeline to disk, and materialize the pre-processed data on a different training
run.

This API enables repeated preprocessing steps to be consolidated, and allowing
re-use of already processed data, trading off disk storage and network bandwidth
for freeing up more valuable CPU resources and accelerator compute time.

The transformation is similar to the existing tf.data.Dataset.cache, with the key difference is being that, unlike cache, snapshot is intended to re-used across different executions of the same input pipelines.

@brijk7 brijk7 added the RFC: Proposed RFC Design Document label Jan 7, 2020
dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)
dataset = dataset.parallel_interleave(TFRecordDataset)
dataset = dataset.map(my_preprocessing_fn)
dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either dataset = dataset.apply(tf.data.experimental.snapshot(...)) or dataset = dataset.snapshot(...) (I would prefer the former before providing backwards compatibility guarantee for the API).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why apply --> snapshot? Is there ever a case where you would snapshot without the .apply call? If not, can we include the apply under the hood, and let the users just dataset.snapshot(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now .apply(...) is the only way to support an experimental tf.data transformation. So the plan would be to expose this as apply and then in a few months, we can promote this to a dataset.snapshot API.

@byronyi
Copy link
Contributor

byronyi commented Jan 8, 2020

Thanks Frank for this RFC. We observe similar performance benefits in caching preprocessed data in multi-epoch training.

Since we already have a ds.cache API that support file system interface, what is the unique value proposition to have another ds.snapshot API?

EDIT: saw that in the doc, might be worth to mention that in the PR summary.

@frankchn
Copy link
Contributor Author

frankchn commented Jan 8, 2020

Thanks Frank for this RFC. We observe similar performance benefits in caching preprocessed data in multi-epoch training.

Since we already have a ds.cache API that support file system interface, what is the unique value proposition to have another ds.snapshot API?

EDIT: saw that in the doc, might worth to mention that in the PR summary.

Thanks for your comments, and added the motivation and differences to ds.cache in the PR summary.

@AakashKumarNain
Copy link
Member

How is this different from processing data, saving them in TFRecords format on the disk and reading it later?

@frankchn
Copy link
Contributor Author

frankchn commented Jan 8, 2020

How is this different from processing data, saving them in TFRecords format on the disk and reading it later?

Thanks for your question! You can definitely manually replicate what we are doing here by just running an input pipeline and saving the processed data to other TFRecord files.

That said, there are a couple of advantages of having this op over a manual process:

First, we can write to/read from snapshots automatically with a single additional op at the end of the input pipeline, without any new code necessary to write to/read from the snapshots.

Second, tf.data snapshot can detect changes in the pipeline (e.g. changing JPEG image decode sizes or pointing to a new set of files) through input pipeline fingerprinting and automatically create new snapshot files with no further intervention. This minimizes potential errors when you are experimenting and changing input pipelines.

Third, the proposed file format we have is somewhat more efficient (both in terms of the file format itself, and the ability to enable gzip/snappy compression on the file level). We also try to optimize the read/write paths as much as possible, so at the limit we believe we are faster or more space-efficient (or both) than plain TFRecord reading/writing.

@AakashKumarNain
Copy link
Member

Thanks @frankchn for the detailed answer. I really liked this point

Second, tf.data snapshot can detect changes in the pipeline (e.g. changing JPEG image decode sizes or pointing to a new set of files) through input pipeline fingerprinting and automatically create new snapshot files with no further intervention. This minimizes potential errors when you are experimenting and changing input pipelines.

Copy link
Member

@katsiapis katsiapis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, had a few questions and suggestions.


2. `compression`: Optional. The type of compression to apply to the snapshot
written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to
None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it default to AUTO? Note that suffixes can be added to the produced files to make it clear to the user (and tools) what the encoding is as well.

Eg .gz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can have the default option be AUTO here and elsewhere, and gradually add tf.data autotuning to snapshots after (autotuning right now starts a few versions of the pipelines, which is not ideal given that we are writing to the file system).

are read from the snapshot are different from the order they're written.

8. `reader_buffer_size`: Optional. Maximum number of elements we can prefetch
reading from the snapshot. Defaults to 1. Increasing this might improve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it default to AUTO? Also, not sure what "1" means here.

Similarly for writter_buffer_size below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

up the buffer before writing them out using `num_writer_threads`.

11. `shuffle_on_read`: Optional. If this is True, then the order in which
examples are produced when reading from a snapshot will be random. Defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random or arbitrary? Is this a local shuffle, a global shuffle, something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified. The behavior emulates Dataset.list_files(shuffle=True) by randomizing the order in which the snapshot files are read back.


2. In case there are multiple workers and the dataset is sharded across
workers, we assume that the number of workers remains the same from one run
to another. If the number changes, we’ll trigger another snapshot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this also affects seed semantics?

Copy link
Contributor

@byronyi byronyi Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest reconsidering this. Elastic training reduces resource fragmentation and improve resource utilization in general. Sometimes the number of workers, and along with learning rate, global batch size, etc. are important hyperparameters to tune by themselves.

What about a collective snapshot design in the context of a distributed file system? Will this be enough for an easier implementation without this assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@byronyi The current tf.data input pipeline design doesn't really allow us to coordinate between workers (e.g. there might failure cases where K worker out of N fails and the snapshot slice is invalid).

That said, I totally understand where you are coming from, and we are planning a data pipeline service (where snapshot can be integrated) so everyone would read elements off a single pipeline and having different numbers of trainers would be less of an issue. There should be a design doc RFC on that in the next few weeks so look for that.

training run directory.

For the current implementation, we will store the data in chunked TFRecord
files. Eventually we may move to other more higher performance data stores or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision this a "sink" configuration in the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katsiapis Sorry, can you clarify what sink you are referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Gus is referring to allowing users to customize the writing data format etc. via some kind of sink configuration. I think as of now, we don't plan to allow for that level of customization because right now its good to have our own reader / writer that is optimized for performance. Its definitely possible that in the future we allow users to customize this though but I think that won't happen in the near future. Will keep this in mind though.

1. If the snapshot directory contains a `metadata` file, we will read the
metadata file.

1. The metadata file contains the following fields:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this suggest that metadata file needs to be modifiable after initial write? Any concurrency implications? Does it make sense to have an approach that works with immutable File artifacts?

Some enhancements that were recently made to TensorFlow Transform might be of interest here (especially their logic and rational behind them, if not the implementation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An immutable metadata file definitely makes sense. We will introduce a metadata.final file to indicate when snapshot has finished writing.

dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)
dataset = dataset.parallel_interleave(TFRecordDataset)
dataset = dataset.map(my_preprocessing_fn)
dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why apply --> snapshot? Is there ever a case where you would snapshot without the .apply call? If not, can we include the apply under the hood, and let the users just dataset.snapshot(...)?

shuffle_on_read=None,
shuffle_seed=None,
mode=None,
snapshot_name=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of args, especially to start off with. Is this perhaps a sign that we should be thinking of a different design pattern here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! How about a SnapshotConfig / Options object? Or did you have something else in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't particularly like how many args there are here either. Perhaps reader/writer threads and buffer sizes can be collapsed into just threads and buffer (cc @rohan100jain)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a config object better? Or just another level of indirection? I feel like users are fatigued by all our TF config objects. One option is to start with fewer, and see which knobs users actually ask for-- because this is experimental, we can easily add. Are there any of these that can be made subsequent API calls? Eg, a set_options or set_buffer_size? I obviously don't have a good solution here-- only problems ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the AUTO as default help here? I think based on our experience, we can come up with reasonable defaults reducing the amount of cognitive overhead the users will have to endure figuring out what to set.

We can also remove the buffer_size option and just set it to 2 * num_threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, going to remove the buffer_size option.

Copy link
Contributor

@byronyi byronyi Jan 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the AUTO as default help here? I think based on our experience, we can come up with reasonable defaults reducing the amount of cognitive overhead the users will have to endure figuring out what to set.

We can also remove the buffer_size option and just set it to 2 * num_threads.

Would you mind to leave it there? Buffer size is an important parameter to tune for distributed FS such as S3 or HDFS. The appropriate buffer size should be bandwidth delay product, not 2*num_threads. This option will be relevant if we would like to maximise the read/write throughput to the underlying storage with both high delay and high available bandwidth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size configuration here is for the reading and writing thread buffers (basically the same effect as having Dataset.prefetch(1)) rather than file system buffers. For the buffers you are talking about, we offer configurations at the file system levels (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc#L67) that would be more general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I am just wondering where can I pass system buffer configuration with the proposed snapshot API (and in general APIs that interact with external storage such as cache).


11. `shuffle_on_read`: Optional. If this is True, then the order in which
examples are produced when reading from a snapshot will be random. Defaults
to False.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an arg versus a call to dataset.shuffle() after the call to tf.data.snapshot? Coupling this + the following kwarg into the API spec here seems unwieldy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emulates the behavior when Dataset.list_files(shuffle=True) and is orthogonal to using dataset.shuffle() after the snapshot op. Clarified the behavior of this flag.

The downside here would be that the user would have to split the preprocessing
and training into potentially different files, and users would be forced to
select whether to train or preprocess on their own, which is not good.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like save/load fits better into existing TF paradigms, and makes the behavior more explicit. Looking at the pipeline above, it's hard to tell what is going to be run on an arbitrary run of the pipeline. You could imagine an individual pipeline doing a load first, then if no file to load, do the preproc/save. That's explicit and doesn't require two separate runners.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with that is the logic that determines whether the input pipeline we want to load or not is not as simple as whether a file exists (the control aspect of the op). Following your train of thought, I think the better alternative would be splitting it up as reader, writer and control as separate ops (which as I indicated above should be the implementation for the next iteration of the op).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design was chosen so that the user doesn't have to worry about changing their existing pipelines much at all, and can benefit from this just by dropping snapshot in.

With save() and load(), users will have to manage saving and loading by themselves and this potentially introduces errors (e.g. users may want accidentally load an old version of a snapshot without realizing it).

That said, perhaps we can re-use the C++ code that implements this into more generic Dataset.save and Dataset.load ops for users who want that sort of control?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @rohan100jain that for the next internal implementation we can expose save/load/control where appropriate while retaining this python interface for existing users who might want a simple solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider that the semantics of save/load are much easier to explain to people specially when you consider the interaction of shuffle_on_read with other options earlier in the input pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During earlier iterations of developing this, we considered save / load style APIs but to make that usable is quite challenging (i.e. the figuring out whether to save a new snapshot or to load). The current API serves a very concrete use case (for some significant internal users) and we feel it makes sense to expose this API to serve that use case.

Your concerns about shuffle_on_read etc. are valid and we'll address them by allowing users to specify a reader_fn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alextp @rohan100jain I have updated the design doc to add a new reader_fn and remove existing reader parameters. PTAL, thanks!


2. In case there are multiple workers and the dataset is sharded across
workers, we assume that the number of workers remains the same from one run
to another. If the number changes, we’ll trigger another snapshot.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we trigger another snapshot exactly? Does that imply attempting a read of the data, failing, and instead re-snapshotting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of workers change, then (at least in the case of sharded TPU training) the num_shards parameter to Dataset.shard will change too. This results in a different graph fingerprint for each worker, and we will go into WRITE mode automatically. If there are no shards and all the workers use exactly the same input pipeline, then this won't happen. Will add something here to clarify.

* 0000001.snapshot

The `fingerprint` is a hash of the input processing graph. The `run-id` is
unique training run ID generated.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this align/not align with existing ckpt + saved model semantics? Is there more we can reuse here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think SavedModel supports comparisons between graphs (or in this case, parts of a graph), which is our main use case here. In general, I am not sure TensorFlow can guarantee to produce the exact same graph (i.e. with the same node names, function names, etc...), so we are utilizing the HashGraph function (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/dataset_utils.cc#L733), which computes a fingerprint while ignoring node names and other things that may vary run by run.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, though I meant the file structure itself and the serialized format, not strictly the fingerprint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point Karmel and we considered this while first developing. Here is why we ended up choosing something of our own -

The checkpointing API is for saving the state of the model and here we're trying to store the entire output of an input pipeline and so API wise there isn't much alignment.

On the implementation side, Checkpointing uses BundleReader / BundleWriter which are designed for key / value accesses to tensors whereas here we want to sequentially read off tensors as soon as possible. We tried Bundle Reader / Writer at first but abandoned it for performance reasons.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of sacrilege-- can we improve the performance of Bundle Reader / Writer instead of just building a new way of doing this?

Copy link
Contributor

@byronyi byronyi Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: with a little bit tweak I managed to get the performance of tensor bundle from 4.5k image/s to 7.2k image/s using 128 threads. Writes are scattered into multiple shards so it wouldn't be too hard to get maximal performance. Up till now the only constrtaint seems to be my 10GbE NIC, and I plan to do another benchmark on a machine with 25GbE network.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karmel While I think its a worthy objective of unifying the Bundle Reader / Writer file format with what we're proposing, I don't want to tie them both together this early. The reason is that since the workloads are quite different, we are still early days into exactly knowing what sort of options and functionality we'll have to build in to get the maximum throughput (e.g. we've needed compression so that we don't saturate network bandwidth too quickly, different threading implementations etc.). So my proposal would be to keep it separate at the moment, gain some experience tuning this workload and then we can pull in some of the learnings into Bundle Reader / Writer.

Copy link
Contributor

@byronyi byronyi Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan100jain I completely agree with you. At the end of day we will be all better off with a format that yields highest performance possible, and we will definitely learned how to build one when experimenting a new format built from scratch.

Being said that, I would also like to keep improving tensor bundle, as the current on-disk cache uses it as underlying storage. @karmel would you be interested in sponsoring an RFC regarding to that? It could serve as a primary input source (TFIndexedRecord), accelerating on-disk CacheDataset, and enhance ShuffleDataset with external shuffle. We could focus on the API first. If Rohan and @frankchn designed a more performant format it could be used for these APIs as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @byronyi -- @rohan100jain / @frankchn , what was the conclusion from the design review as to input format + readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karmel -- We are going to re-investigate TFRecordReader performance and see if we can bring that up to equivalency with our implementation.

As we can see, the end user simply has to add this transformation in order to
use this functionality. In essence, the transformation is similar to the
existing `tf.data.Dataset.cache`, with the key difference is being that, unlike
`cache`, `snapshot` is intended to re-used across different executions of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cache provide any advantages over snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache can be in memory, which is useful for testing and small datasets.

Copy link
Contributor

@byronyi byronyi Jan 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache can be on-disk as well. Its lifetime is more confined (to a specific job), and no worries about crash-safe semantics. Lifetime of snapshots could be unbounded, and I guess that's why we need to explicitly expire them.

reader_buffer_size=None,
num_writer_threads=None,
writer_buffer_size=None,
shuffle_on_read=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does snapshotting include its own shuffling?

How does the snapshot shuffling interact with the shuffling performed in existing tf.data transformations like shuffle?

The way these options are presented now mean you have different behavior when loading from a snapshot than you had when not loading from a snapshot, which means the existing shuffling behavior is not guaranteed even if there is no shuffle_on_read (as reading from a snapshot cannot simulate shuffling files).

The downside here would be that the user would have to split the preprocessing
and training into potentially different files, and users would be forced to
select whether to train or preprocess on their own, which is not good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider that the semantics of save/load are much easier to explain to people specially when you consider the interaction of shuffle_on_read with other options earlier in the input pipeline.

* 0000001.snapshot

The `fingerprint` is a hash of the input processing graph. The `run-id` is
unique training run ID generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karmel While I think its a worthy objective of unifying the Bundle Reader / Writer file format with what we're proposing, I don't want to tie them both together this early. The reason is that since the workloads are quite different, we are still early days into exactly knowing what sort of options and functionality we'll have to build in to get the maximum throughput (e.g. we've needed compression so that we don't saturate network bandwidth too quickly, different threading implementations etc.). So my proposal would be to keep it separate at the moment, gain some experience tuning this workload and then we can pull in some of the learnings into Bundle Reader / Writer.

@frankchn
Copy link
Contributor Author

Hi everyone,

Thank you for commenting on this design review. We've recently completed an internal round of design review meetings for this feature, and we have taken into account many of the comments and have revised the design document.

Specifically, the largest changes we have made are (1) consolidating the various options into a user-specifiable reader_fn and writer_fn, where users can customize how dataset is read and written, and (2) removing snapshot_name and snapshot_mode, and instead implementing with_snapshot_fingerprint for users to override fingerprinting on parts of their input pipeline graph.

We appreciate all your comments and are looking forward to implementing this feature soon!

Thanks,

Frank

@ematejska ematejska requested review from ematejska and removed request for brijk7 February 11, 2020 00:43
@ematejska ematejska merged commit bc6bce3 into tensorflow:master Feb 11, 2020
@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Feb 11, 2020
@byronyi
Copy link
Contributor

byronyi commented Feb 11, 2020

Congratulations! Hope this to make it into 2.2.

@markemus
Copy link

I wrote a simple automatic serialization module a while back that generates a "header" file from a dataset of tensors. The header file stores the datatypes and shapes of each tensor and there's a function to restore the tensors using the TFrecord and header files. Would that be a useful addition to Snapshot? I'm unclear about whether the current implementation supports this header feature already or not.

https://gist.github.com/markemus/74ba47d0b58f91d7aa7885341ed3b1b8

Obviously it would be ideal if the header could be added directly into the tfrecords file but it's currently stored separately for compatibility reasons.

@frankchn
Copy link
Contributor Author

Thanks @markemus! The implementation of Snapshot has something similar but in C++ for higher performance, but save and load is something we are actively considering for tf.data and we are thinking through ways of saving and loading the dtypes safely and efficiently too.

@markemus
Copy link

@frankchn cool! I'm really looking forward to snapshot, this is badly needed and it's great that you guys are working on this. If any of super_serial is useful it's all yours.

@byronyi byronyi mentioned this pull request Apr 21, 2020
@byronyi
Copy link
Contributor

byronyi commented Jun 18, 2020

@frankchn @aaudiber @rohan100jain I have a question regarding to use snapshot in conjunction with tf.data service. Would it be feasible to shard the snapshot into multiple tf.data service workers?

@aaudiber
Copy link
Contributor

@byronyi Absolutely! We plan to integrate snapshotting and tf.data service so that the tf.data service speeds up the first (normally slow) epoch. The idea is to temporarily use the tf.data service to write the snapshot quickly, then shut down the service (saving resource costs), and read from the snapshot for the remaining epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.