TFDV on Dataflow getting OOMs frequently #190

cyc · 2021-08-12T15:41:20Z

I am using TFDV 1.2.0 and have a problem where I am consistently getting workers OOMing on Dataflow even with very large instance types (e.g. n2-highmem-16 and n2-highmem-32). I've tried decreasing --number_of_worker_harness_threads to half the available number of CPUs in hopes of decreasing memory usage to no avail. The OOMs typically happen at the very end of the job when it is done loading the data but still trying to combine the stats together.

The dataset itself is quite large, with 4000-5000 features spanning ~1e9 rows (and some features are quite long varlen). Do you have any tips for how to decrease memory usage on workers? I suspect that the issue may come from some high-cardinality (~1e8) string features we have, as computing the frequency counts for these features is probably very memory-intensive.

The text was updated successfully, but these errors were encountered:

andylou2 · 2021-08-12T20:13:49Z

Hi Chris!

Do you have the beam stage names where the OOM is occuring?

During the combine phase (when calculating statistics), we have a limit on the size of the record batch (which is what is stored in memory by each beam worker) [1]. Would tweaking this value help?

We also have an experimental StatsOption: experimental_use_sketch_based_topk_uniques[2]. This might help because it allows the top_k to be a mergeable statistic, with a compact property (i.e. beam may be able to compress the in mem representation when it chooses to).

[1] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_impl.py#L602-L608

[2] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_options.py#L141-L142

cyc · 2021-08-12T20:37:36Z

Hi @andylou2 I believe it is RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators/CombinePerKey.

I did try using experimental_use_sketch_based_topk_uniques but it still ran out of memory unfortunately. I will try running it again with some of the high-cardinality string features removed from the dataset and see if that fixes the issue; if so, that may be the root cause.

I am not sure whether changing the rb size threshold would help. Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them? Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

andylou2 · 2021-08-12T21:26:32Z

Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them?

when _MERGE_RECORD_BATCH_BYTE_SIZE_THRESHOLD is reached, the beam worker will compute the stats on the current batch (even if desired batch size isn't reached), and subsequently clear the record batches from RAM.

Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

It probably isn't. But I do believe each instance can have multiple beam workers. And this threshold only applies per worker.

I don't know enough about dataflow as I wish I do, but I think --number_of_worker_harness_threads may be the wrong flag. It looks like --max_num_workers may be the correct one: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling. Although, the note below says that the autoscaling is based on the number of threads.. so maybe not so helpful.

cyc · 2021-08-13T02:22:56Z

I've tried a bunch of different things to try to fix this issue on my end, but haven't had too much success. The only thing that consistently seems to work is to set StatsOptions.sample_rate=0.001, which seems pretty extreme.

The problem actually seems to be in BasicStatsGenerator, not TopKUniques. If I set StatsOptions.add_default_generators=False and then manually add just BasicStatsGenerator to the list of generators and exclude TopK, I still get OOM (even on a n1-highmem-64, which has something like 416GB of RAM). The OOM occurs after all the data is loaded and processed and the resulting stats are being combined together by just a few workers.

I will open a GCP support ticket and reference the job ids of the failed dataflow runs in question since that might help debug it a little better.

cyc · 2021-08-13T14:26:07Z

Support case number is 28706271 and includes some more details, I can cc people on the ticket if need be

cyc · 2021-09-01T00:32:13Z

Can you clarify whether there are any workarounds for this issue? We are pretty constantly seeing this OOM issue with 1.2.0. The only solution seems to be to change StatsOptions.sample_rate to a very small number such that the num_instances counter is less than 1e6. Given that I can probably load 1e6 rows entirely into memory on a n1-highmem-16 it seems a bit strange that the memory overhead for computing/combining stats would be so high.

entrpn · 2021-09-17T22:34:54Z

I was having this same issue with my datasets and opened a gcp ticket and everything. Eventually it came down to my instances not being large enough even though I thought they were. After experimenting I ended up on n2-highmem-8. I wonder with how many features you have, you are having the same issue.

You said you tried n2-highmem-16. As a test have you tried spinning up much much larger instances just to see if they can get through your dataset? Like some exaggeration like n2-highmem-48.

rcrowe-google · 2021-11-05T22:25:15Z

@cyc - TFDV 1.4.0 was released last week, and we expect that it will reduce the number of OOM errors. Has anyone tried it yet?

cyc · 2021-11-08T15:40:47Z

Hi @rcrowe-google, thanks for the heads up. Could you point to the specific changes or commits that should help with OOM errors? Or let me know if there are any particular settings I should try adding, such as the experimental sketch-based generators? I can experiment with upgrading to 1.4.0 later this week.

cyc · 2021-11-10T16:41:59Z

I tried TFDV 1.4.0 and compared to 1.2.0 on a relatively small dataset (~100m rows, ~5k columns) and still ran into OOM issues on dataflow with n2-highmem-64 machine types for both.

rcrowe-google · 2021-11-10T22:01:38Z

@cyc - Thanks for taking the time to test that, and I'm very disappointed that it didn't work. I think the that what we need most at this point is code and data that we can use to reproduce the exact problem. Can you share that either here or in the GCP case?

zexuan-zhou · 2024-08-21T18:32:55Z

Hi. Is there any update on this? I'm running into an exact issue OOM. I have enabled enable_dynamic_thread_scaling
Workflow failed. Causes: S10:GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/GroupByKey/Read+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/Combine+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PreCombineFn)/Combine/Extract+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/Map(StripNonce)+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/WindowIntoOriginal+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/Flatten/OutputIdentity+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine/Partial+GenerateStatistics(TRAINING)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
TFDV version is 1.14

I suspect this can be due to a hot key issue. Is there any filter we can apply to filter out hot keys during GenerateStatistics?

arghyaganguly self-assigned this Aug 13, 2021

arghyaganguly added the type:performance label Aug 13, 2021

arghyaganguly assigned caveness and unassigned arghyaganguly Aug 13, 2021

arghyaganguly added the stat:awaiting tensorflower label Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFDV on Dataflow getting OOMs frequently #190

TFDV on Dataflow getting OOMs frequently #190

cyc commented Aug 12, 2021 •

edited

Loading

andylou2 commented Aug 12, 2021

cyc commented Aug 12, 2021

andylou2 commented Aug 12, 2021 •

edited

Loading

cyc commented Aug 13, 2021

cyc commented Aug 13, 2021

cyc commented Sep 1, 2021

entrpn commented Sep 17, 2021

rcrowe-google commented Nov 5, 2021

cyc commented Nov 8, 2021

cyc commented Nov 10, 2021

rcrowe-google commented Nov 10, 2021

zexuan-zhou commented Aug 21, 2024 •

edited

Loading

TFDV on Dataflow getting OOMs frequently #190

TFDV on Dataflow getting OOMs frequently #190

Comments

cyc commented Aug 12, 2021 • edited Loading

andylou2 commented Aug 12, 2021

cyc commented Aug 12, 2021

andylou2 commented Aug 12, 2021 • edited Loading

cyc commented Aug 13, 2021

cyc commented Aug 13, 2021

cyc commented Sep 1, 2021

entrpn commented Sep 17, 2021

rcrowe-google commented Nov 5, 2021

cyc commented Nov 8, 2021

cyc commented Nov 10, 2021

rcrowe-google commented Nov 10, 2021

zexuan-zhou commented Aug 21, 2024 • edited Loading

cyc commented Aug 12, 2021 •

edited

Loading

andylou2 commented Aug 12, 2021 •

edited

Loading

zexuan-zhou commented Aug 21, 2024 •

edited

Loading