Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom fanout to CacheableCombineAccumulate.InitialCombineGlobally #183

Open
cyc opened this issue Jun 29, 2020 · 1 comment
Open

Add custom fanout to CacheableCombineAccumulate.InitialCombineGlobally #183

cyc opened this issue Jun 29, 2020 · 1 comment

Comments

@cyc
Copy link

cyc commented Jun 29, 2020

Recently I have been running some benchmarks on TFT using a variety of different datasets. One thing I have noticed is that CacheableCombineAccumulate seems to have a bit of a scaling issue, in particular with tft.bucketize and the QuantilesCombiner.

As a concrete example, I ran an experiment with a dataset of 20 billion rows and around 1000 dense scalar columns. I added just one analyzer: tft.bucketize(num_buckets=100, elementwise=True) and applied it to the 1000 columns concatenated together.

Dataflow job ID: 2020-06-12_11_25_12-8283309558685381372
Elapsed time: 21 hr 57 min
Total vCPU time: 11,455.279 vCPU hr

Around 19 of the 22 wall-clock hours were spent on the InitialCombineGlobally step.

To verify that this was a bottleneck on InitialCombineGlobally I forked _IntermediateAccumulateCombineImpl and added .with_fanout(1000) to beam.CombineGlobally.

Dataflow job ID: 2020-06-22_13_45_28-14720730178676069772
Elapsed time: 2 hr 43 min
Total vCPU time: 6,638.944 vCPU hr

It would be nice if we could annotate our analyzers/combiners with the desired fanout size so that we have control over this issue should it arise. Also, while the above experiment seems like a rather perversely-chosen dataset and transformation combination that is unlikely to exist in practice, this workflow is somewhat similar to certain real-world problems that I am working with so this would be very helpful to have fixed.

@zoyahav
Copy link
Member

zoyahav commented Oct 9, 2020

We've since added fanout in TFT 0.24, could you let us know if the issue has been resolved with this version?
https://github.com/tensorflow/transform/releases/tag/v0.24.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants