-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFDV on Dataflow getting OOMs frequently #190
Comments
Hi Chris! Do you have the beam stage names where the OOM is occuring? During the combine phase (when calculating statistics), we have a limit on the size of the record batch (which is what is stored in memory by each beam worker) [1]. Would tweaking this value help? We also have an experimental StatsOption: |
Hi @andylou2 I believe it is I did try using I am not sure whether changing the rb size threshold would help. Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically |
when
It probably isn't. But I do believe each instance can have multiple beam workers. And this threshold only applies per worker. I don't know enough about dataflow as I wish I do, but I think |
I've tried a bunch of different things to try to fix this issue on my end, but haven't had too much success. The only thing that consistently seems to work is to set The problem actually seems to be in BasicStatsGenerator, not TopKUniques. If I set I will open a GCP support ticket and reference the job ids of the failed dataflow runs in question since that might help debug it a little better. |
Support case number is 28706271 and includes some more details, I can cc people on the ticket if need be |
Can you clarify whether there are any workarounds for this issue? We are pretty constantly seeing this OOM issue with 1.2.0. The only solution seems to be to change |
I was having this same issue with my datasets and opened a gcp ticket and everything. Eventually it came down to my instances not being large enough even though I thought they were. After experimenting I ended up on You said you tried |
@cyc - TFDV 1.4.0 was released last week, and we expect that it will reduce the number of OOM errors. Has anyone tried it yet? |
Hi @rcrowe-google, thanks for the heads up. Could you point to the specific changes or commits that should help with OOM errors? Or let me know if there are any particular settings I should try adding, such as the experimental sketch-based generators? I can experiment with upgrading to 1.4.0 later this week. |
I tried TFDV 1.4.0 and compared to 1.2.0 on a relatively small dataset (~100m rows, ~5k columns) and still ran into OOM issues on dataflow with |
@cyc - Thanks for taking the time to test that, and I'm very disappointed that it didn't work. I think the that what we need most at this point is code and data that we can use to reproduce the exact problem. Can you share that either here or in the GCP case? |
Hi. Is there any update on this? I'm running into an exact issue OOM. I have enabled enable_dynamic_thread_scaling I suspect this can be due to a hot key issue. Is there any filter we can apply to filter out hot keys during GenerateStatistics? |
I am using TFDV 1.2.0 and have a problem where I am consistently getting workers OOMing on Dataflow even with very large instance types (e.g.
n2-highmem-16
andn2-highmem-32
). I've tried decreasing--number_of_worker_harness_threads
to half the available number of CPUs in hopes of decreasing memory usage to no avail. The OOMs typically happen at the very end of the job when it is done loading the data but still trying to combine the stats together.The dataset itself is quite large, with 4000-5000 features spanning ~1e9 rows (and some features are quite long varlen). Do you have any tips for how to decrease memory usage on workers? I suspect that the issue may come from some high-cardinality (~1e8) string features we have, as computing the frequency counts for these features is probably very memory-intensive.
The text was updated successfully, but these errors were encountered: