-
Notifications
You must be signed in to change notification settings - Fork 8
fix: ooms while reading large columns #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe change adds a new configuration setting in the Spark session builder within the Changes
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (3)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
piyush-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a comment on the rationale for future folks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)
66-66: Consider making batch size configurable.Extract the value to a configuration parameter for flexibility.
- .config("spark.sql.parquet.columnarReaderBatchSize", "16") + .config("spark.sql.parquet.columnarReaderBatchSize", conf.getOrElse("COLUMNAR_READER_BATCH_SIZE", "16"))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)
66-66: Verify the batch size value.Let's confirm if 16 is the optimal value for preventing OOMs.
✅ Verification successful
🌐 Web query:
What is the recommended value for spark.sql.parquet.columnarReaderBatchSize to prevent OOM issues?💡 Result:
The recommended value for `spark.sql.parquet.columnarReaderBatchSize` depends on your data characteristics and memory constraints, but adjustments are often necessary when working with nested data or arrays to avoid Out-Of-Memory (OOM) errors: 1. **Default Value**: The default is **4096** rows per batch[1][6]. 2. **Common Adjustments**: - Lower values like **32–64** are suggested for workloads involving arrays or nested structures[1]. - In one example, reducing to **16** resolved OOM errors caused by excessive vector memory usage[1]. 3. **Trade-offs**: - Smaller batches reduce per-task memory pressure but may slightly degrade read performance[1][4]. - Larger batches improve throughput but risk OOM if data structures are complex[1][6]. 4. **Tuning Strategy**: Start with defaults, monitor executor/driver memory during Parquet reads, and incrementally lower the batch size if OOM occurs (e.g., halve from 4096 → 2048 → 1024 → ...)[1][4]. For array-heavy datasets, begin testing at **32–128**[1][10]. Example configuration for memory-constrained scenarios: ```properties spark.sql.parquet.columnarReaderBatchSize 32 ``` This balances memory safety with performance when handling nested schemas[1][4]. Always validate through workload-specific testing. Citations: - 1: https://nxcals-docs.web.cern.ch/1.4.7/user-guide/spark-tuning/ - 2: https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html - 3: https://www.databricks.com/discover/pages/optimize-data-workloads-guide - 4: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-properties.html - 5: https://www.unraveldata.com/common-reasons-spark-applications-slow-fail-part-1/ - 6: https://spark.apache.org/docs/3.5.0/configuration.html - 7: https://books.japila.pl/spark-sql-internals/configuration-properties/ - 8: https://community.cloudera.com/t5/Support-Questions/OOM-issues-when-writing-into-parquet/m-p/372532/highlight/true - 9: https://spark.apache.org/docs/latest/configuration.html - 10: https://www.clairvoyant.ai/blog/apache-spark-out-of-memory-issue
CatalystUtil.scala batch size validation:
- The value of 16 is a valid approach, as examples exist where reducing to 16 resolved OOM issues.
- Although 32–64 is common for some workloads, 16 is acceptable if it effectively mitigates memory overflows.
done |
tchow-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6af8716 to
3082b27
Compare
3082b27 to
2bdb1ec
Compare
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on etsy data. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on etsy data. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on our clients data. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on our clients data. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary the default column reader batch size is 4096 - reads that many rows into memory buffer at once. that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16. tested on our clients data. ## Cheour clientslist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing performance by adding an optimized configuration for reading Parquet files in Spark sessions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
the default column reader batch size is 4096 - reads that many rows into memory buffer at once.
that causes ooms on large columns, for catalyst we only need to read one row at a time. for interactive we set the limit to 16.
tested on etsy data.
Checklist
Summary by CodeRabbit