Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFDV on Dataflow getting OOMs frequently #190

Open
cyc opened this issue Aug 12, 2021 · 12 comments
Open

TFDV on Dataflow getting OOMs frequently #190

cyc opened this issue Aug 12, 2021 · 12 comments

Comments

@cyc
Copy link
Contributor

cyc commented Aug 12, 2021

I am using TFDV 1.2.0 and have a problem where I am consistently getting workers OOMing on Dataflow even with very large instance types (e.g. n2-highmem-16 and n2-highmem-32). I've tried decreasing --number_of_worker_harness_threads to half the available number of CPUs in hopes of decreasing memory usage to no avail. The OOMs typically happen at the very end of the job when it is done loading the data but still trying to combine the stats together.

The dataset itself is quite large, with 4000-5000 features spanning ~1e9 rows (and some features are quite long varlen). Do you have any tips for how to decrease memory usage on workers? I suspect that the issue may come from some high-cardinality (~1e8) string features we have, as computing the frequency counts for these features is probably very memory-intensive.

@andylou2
Copy link

Hi Chris!

Do you have the beam stage names where the OOM is occuring?

During the combine phase (when calculating statistics), we have a limit on the size of the record batch (which is what is stored in memory by each beam worker) [1]. Would tweaking this value help?

We also have an experimental StatsOption: experimental_use_sketch_based_topk_uniques[2]. This might help because it allows the top_k to be a mergeable statistic, with a compact property (i.e. beam may be able to compress the in mem representation when it chooses to).

[1] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_impl.py#L602-L608

[2] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_options.py#L141-L142

@cyc
Copy link
Contributor Author

cyc commented Aug 12, 2021

Hi @andylou2 I believe it is RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators/CombinePerKey.

I did try using experimental_use_sketch_based_topk_uniques but it still ran out of memory unfortunately. I will try running it again with some of the high-cardinality string features removed from the dataset and see if that fixes the issue; if so, that may be the root cause.

I am not sure whether changing the rb size threshold would help. Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them? Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

@andylou2
Copy link

andylou2 commented Aug 12, 2021

Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them?

when _MERGE_RECORD_BATCH_BYTE_SIZE_THRESHOLD is reached, the beam worker will compute the stats on the current batch (even if desired batch size isn't reached), and subsequently clear the record batches from RAM.

Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

It probably isn't. But I do believe each instance can have multiple beam workers. And this threshold only applies per worker.

I don't know enough about dataflow as I wish I do, but I think --number_of_worker_harness_threads may be the wrong flag. It looks like --max_num_workers may be the correct one: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling. Although, the note below says that the autoscaling is based on the number of threads.. so maybe not so helpful.

@cyc
Copy link
Contributor Author

cyc commented Aug 13, 2021

I've tried a bunch of different things to try to fix this issue on my end, but haven't had too much success. The only thing that consistently seems to work is to set StatsOptions.sample_rate=0.001, which seems pretty extreme.

The problem actually seems to be in BasicStatsGenerator, not TopKUniques. If I set StatsOptions.add_default_generators=False and then manually add just BasicStatsGenerator to the list of generators and exclude TopK, I still get OOM (even on a n1-highmem-64, which has something like 416GB of RAM). The OOM occurs after all the data is loaded and processed and the resulting stats are being combined together by just a few workers.

I will open a GCP support ticket and reference the job ids of the failed dataflow runs in question since that might help debug it a little better.

@cyc
Copy link
Contributor Author

cyc commented Aug 13, 2021

Support case number is 28706271 and includes some more details, I can cc people on the ticket if need be

@cyc
Copy link
Contributor Author

cyc commented Sep 1, 2021

Can you clarify whether there are any workarounds for this issue? We are pretty constantly seeing this OOM issue with 1.2.0. The only solution seems to be to change StatsOptions.sample_rate to a very small number such that the num_instances counter is less than 1e6. Given that I can probably load 1e6 rows entirely into memory on a n1-highmem-16 it seems a bit strange that the memory overhead for computing/combining stats would be so high.

@entrpn
Copy link

entrpn commented Sep 17, 2021

I was having this same issue with my datasets and opened a gcp ticket and everything. Eventually it came down to my instances not being large enough even though I thought they were. After experimenting I ended up on n2-highmem-8. I wonder with how many features you have, you are having the same issue.

You said you tried n2-highmem-16. As a test have you tried spinning up much much larger instances just to see if they can get through your dataset? Like some exaggeration like n2-highmem-48.

@rcrowe-google
Copy link

@cyc - TFDV 1.4.0 was released last week, and we expect that it will reduce the number of OOM errors. Has anyone tried it yet?

@cyc
Copy link
Contributor Author

cyc commented Nov 8, 2021

Hi @rcrowe-google, thanks for the heads up. Could you point to the specific changes or commits that should help with OOM errors? Or let me know if there are any particular settings I should try adding, such as the experimental sketch-based generators? I can experiment with upgrading to 1.4.0 later this week.

@cyc
Copy link
Contributor Author

cyc commented Nov 10, 2021

I tried TFDV 1.4.0 and compared to 1.2.0 on a relatively small dataset (~100m rows, ~5k columns) and still ran into OOM issues on dataflow with n2-highmem-64 machine types for both.

@rcrowe-google
Copy link

@cyc - Thanks for taking the time to test that, and I'm very disappointed that it didn't work. I think the that what we need most at this point is code and data that we can use to reproduce the exact problem. Can you share that either here or in the GCP case?

@zexuan-zhou
Copy link

zexuan-zhou commented Aug 21, 2024

Hi. Is there any update on this? I'm running into an exact issue OOM. I have enabled enable_dynamic_thread_scaling
Workflow failed. Causes: S10:GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/GroupByKey/Read+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/Combine+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/Combine/Extract+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/Map(StripNonce)+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/WindowIntoOriginal+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/Flatten/OutputIdentity+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine/Partial+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
TFDV version is 1.14

I suspect this can be due to a hot key issue. Is there any filter we can apply to filter out hot keys during GenerateStatistics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants