-
Notifications
You must be signed in to change notification settings - Fork 8
perf: shuffle-free temporal join #793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3293bfc
1056963
4a5733c
18d74ae
01ad19e
492df92
bf63252
deb457c
33afd4a
24b528c
6708e2e
b6593fc
274f3a9
17f9570
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| #!/bin/bash | ||
|
|
||
| # UnionJoinTest with Manual Heap Dump Generation | ||
|
|
||
| cd /Users/nsimha/repos/chronon | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we add this file in platform instead of chronon?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm - we can move it later if it indeed belongs there. I am not convinced that it should be far from the code it is meant to profile. |
||
|
|
||
| # Build the test target | ||
| echo "Building UnionJoinTest..." | ||
| bazel build //spark:join_test_test_suite_src_test_scala_ai_chronon_spark_test_join_UnionJoinTest.scala | ||
|
|
||
| # Get the generated script path | ||
| BAZEL_SCRIPT="bazel-bin/spark/join_test_test_suite_src_test_scala_ai_chronon_spark_test_join_UnionJoinTest.scala" | ||
|
|
||
| if [[ ! -f "$BAZEL_SCRIPT" ]]; then | ||
| echo "Error: Bazel script not found at $BAZEL_SCRIPT" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Set up runfiles directory for classpath resolution | ||
| RUNFILES_DIR="$(pwd)/bazel-bin/spark/join_test_test_suite_src_test_scala_ai_chronon_spark_test_join_UnionJoinTest.scala.runfiles" | ||
| export JAVA_RUNFILES="$RUNFILES_DIR" | ||
| export TEST_SRCDIR="$RUNFILES_DIR" | ||
|
|
||
| # Extract and expand classpath | ||
| RUNPATH="${RUNFILES_DIR}/chronon/" | ||
| RUNPATH="${RUNPATH#$PWD/}" | ||
| RAW_CLASSPATH=$(sed -n '249p' "$BAZEL_SCRIPT" | cut -d'"' -f2) | ||
| CLASSPATH=$(echo "$RAW_CLASSPATH" | sed "s|\${RUNPATH}|$RUNPATH|g") | ||
|
|
||
| if [[ -z "$CLASSPATH" ]]; then | ||
| echo "Error: Failed to extract classpath" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Successfully extracted classpath (${#CLASSPATH} characters)" | ||
|
|
||
| # Create heap dump file path | ||
| HEAP_DUMP_FILE="/Users/nsimha/repos/chronon/.ijwb/unionjoin-heapdump-$(date +%Y%m%d-%H%M%S).hprof" | ||
| echo "Heap dump will be saved to: $HEAP_DUMP_FILE" | ||
|
|
||
| # JVM settings with more aggressive heap dump options | ||
| JVM_OPTS="-Xmx8g -Xms2g" # Reduced max heap to force memory pressure | ||
| MODULE_OPENS="--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED" | ||
| SYSTEM_PROPS="-DRULES_SCALA_MAIN_WS_NAME=chronon -DRULES_SCALA_ARGS_FILE=spark/join_test_test_suite_src_test_scala_ai_chronon_spark_test_join_UnionJoinTest.scala.args" | ||
|
|
||
| # Heap dump and GC options | ||
| PROFILER_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$HEAP_DUMP_FILE -XX:+PrintGC -XX:+PrintGCDetails -XX:+UseG1GC -Xlog:gc*:unionjoin-gc-$(date +%Y%m%d-%H%M%S).log:time" | ||
|
|
||
| echo "Starting UnionJoinTest with heap profiling..." | ||
| echo "Memory settings: $JVM_OPTS" | ||
| echo "Working directory: $(pwd)" | ||
| echo "" | ||
|
|
||
| # Run in background and get PID for manual heap dump | ||
| java \ | ||
| $JVM_OPTS \ | ||
| $MODULE_OPENS \ | ||
| $PROFILER_OPTS \ | ||
| $SYSTEM_PROPS \ | ||
| -classpath "$CLASSPATH" \ | ||
| io.bazel.rulesscala.scala_test.Runner & | ||
|
|
||
| JAVA_PID=$! | ||
| echo "Java process started with PID: $JAVA_PID" | ||
|
|
||
| # Wait a bit for the process to start and consume some memory | ||
| echo "Waiting 30 seconds for test to initialize..." | ||
| sleep 30 | ||
|
|
||
| # Check if process is still running | ||
| if kill -0 $JAVA_PID 2>/dev/null; then | ||
| echo "Generating manual heap dump..." | ||
| MANUAL_DUMP="/Users/nsimha/repos/chronon/.ijwb/unionjoin-manual-$(date +%Y%m%d-%H%M%S).hprof" | ||
| jcmd $JAVA_PID GC.run_finalization | ||
| jcmd $JAVA_PID VM.gc | ||
| jcmd $JAVA_PID VM.gc | ||
| jcmd $JAVA_PID GC.run_finalization | ||
| sleep 5 | ||
| jcmd $JAVA_PID VM.gc | ||
| echo "Creating heap dump with jcmd..." | ||
| jcmd $JAVA_PID GC.dump_heap $MANUAL_DUMP | ||
| echo "Manual heap dump saved to: $MANUAL_DUMP" | ||
| else | ||
| echo "Process already terminated" | ||
| fi | ||
|
|
||
| # Wait for the test to complete | ||
| wait $JAVA_PID | ||
| EXIT_CODE=$? | ||
|
|
||
| echo "" | ||
| echo "Test completed with exit code: $EXIT_CODE" | ||
| echo "Checking for heap dump files..." | ||
| ls -la /Users/nsimha/repos/chronon/.ijwb/*.hprof 2>/dev/null || echo "No heap dump files found" | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -17,7 +17,7 @@ | |||||||||
| package ai.chronon.spark | ||||||||||
|
|
||||||||||
| import ai.chronon.api | ||||||||||
| import ai.chronon.api.{Constants, DateRange, ThriftJsonCodec} | ||||||||||
| import ai.chronon.api.{Constants, DateRange, PartitionRange, ThriftJsonCodec} | ||||||||||
| import ai.chronon.api.Constants.MetadataDataset | ||||||||||
|
Comment on lines
+20
to
21
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove unused
-import ai.chronon.api.{Constants, DateRange, PartitionRange, ThriftJsonCodec}
+import ai.chronon.api.{Constants, PartitionRange, ThriftJsonCodec}📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||
| import ai.chronon.api.Extensions.{GroupByOps, JoinPartOps, MetadataOps, SourceOps} | ||||||||||
| import ai.chronon.api.planner.RelevantLeftForJoinPart | ||||||||||
|
|
@@ -27,6 +27,7 @@ import ai.chronon.online.fetcher.{ConfPathOrName, FetchContext, FetcherMain, Met | |||||||||
| import ai.chronon.orchestration.{JoinMergeNode, JoinPartNode} | ||||||||||
| import ai.chronon.spark.batch._ | ||||||||||
| import ai.chronon.spark.catalog.{Format, TableUtils} | ||||||||||
| import ai.chronon.spark.join.UnionJoin | ||||||||||
| import ai.chronon.spark.stats.{CompareBaseJob, CompareJob, ConsistencyJob} | ||||||||||
| import ai.chronon.spark.stats.drift.{Summarizer, SummaryPacker, SummaryUploader} | ||||||||||
| import ai.chronon.spark.streaming.JoinSourceRunner | ||||||||||
|
|
@@ -275,6 +276,30 @@ object Driver { | |||||||||
|
|
||||||||||
| def run(args: Args): Unit = { | ||||||||||
| val tableUtils = args.buildTableUtils() | ||||||||||
|
|
||||||||||
| if (tableUtils.sparkSession.conf.get("spark.chronon.join.backfill.mode.skewFree", "false").toBoolean) { | ||||||||||
| logger.info(s" >>> Running join backfill in skew free mode <<< ") | ||||||||||
| val startPartition = args.startPartitionOverride.toOption.getOrElse(args.joinConf.left.query.startPartition) | ||||||||||
| val endPartition = args.endDate() | ||||||||||
|
|
||||||||||
| val joinName = args.joinConf.metaData.name | ||||||||||
| val stepDays = args.stepDays.toOption.getOrElse(1) | ||||||||||
|
|
||||||||||
| logger.info( | ||||||||||
| s"Filling partitions for join:$joinName, partitions:[$startPartition, $endPartition], steps:$stepDays") | ||||||||||
|
|
||||||||||
| val partitionRange = PartitionRange(startPartition, endPartition)(tableUtils.partitionSpec) | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be careful using Defaulting to do that led to this bug because users can technically define a custom partitionSpec at the Source level. we could do something like: cc @tchow-zlai
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we know if all of their tables follow a consistent partition spec now? I remember we did ask them to staging query that one additional table. so maybe we're good here now
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is supposed to be the output partition spec. while reading left & right data we translate to output partition spec. |
||||||||||
| val partitionSteps = partitionRange.steps(stepDays) | ||||||||||
|
|
||||||||||
| partitionSteps.zipWithIndex.foreach { case (stepRange, idx) => | ||||||||||
| logger.info(s"Processing range $stepRange (${idx + 1}/${partitionSteps.length})") | ||||||||||
| UnionJoin.computeJoinAndSave(args.joinConf, stepRange)(tableUtils) | ||||||||||
| logger.info(s"Wrote range $stepRange (${idx + 1}/${partitionSteps.length})") | ||||||||||
| } | ||||||||||
|
|
||||||||||
| return | ||||||||||
| } | ||||||||||
|
|
||||||||||
| val join = new Join( | ||||||||||
| args.joinConf, | ||||||||||
| args.endDate(), | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -483,6 +483,47 @@ object GroupBy { | |
| result.setSources(newSources) | ||
| } | ||
|
|
||
| def inputDf(groupByConfOld: api.GroupBy, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks similar to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. didn't want to touch that - yet. It is very much in the critical path for everything. |
||
| queryRange: PartitionRange, | ||
| tableUtils: TableUtils, | ||
| computeDependency: Boolean = false): DataFrame = { | ||
|
|
||
| logger.info(s"\n----[Processing GroupBy: ${groupByConfOld.metaData.name}]----") | ||
|
|
||
| val groupByConf = replaceJoinSource(groupByConfOld, queryRange, tableUtils, computeDependency) | ||
|
|
||
| val inputDf = groupByConf.sources.toScala | ||
| .map { source => | ||
| sourceDf(groupByConf, | ||
| source, | ||
| groupByConf.getKeyColumns.toScala, | ||
| queryRange, | ||
| tableUtils, | ||
| groupByConf.maxWindow, | ||
| groupByConf.inferredAccuracy) | ||
|
|
||
| } | ||
| .reduce { (df1, df2) => | ||
| // align the columns by name - when one source has select * the ordering might not be aligned | ||
| val columns1 = df1.schema.fields.map(_.name) | ||
| df1.union(df2.selectExpr(columns1: _*)) | ||
| } | ||
|
|
||
| def doesNotNeedTime = !Option(groupByConf.getAggregations).exists(_.toScala.needsTimestamp) | ||
| def hasValidTimeColumn = inputDf.schema.find(_.name == Constants.TimeColumn).exists(_.dataType == LongType) | ||
|
|
||
| require( | ||
| doesNotNeedTime || hasValidTimeColumn, | ||
| s"Time column, ts doesn't exists (or is not a LONG type) for groupBy ${groupByConf.metaData.name}, but you either have windowed aggregation(s) or time based aggregation(s) like: " + | ||
| "first, last, firstK, lastK. \n" + | ||
| "Please note that for the entities case, \"ts\" needs to be explicitly specified in the selects." | ||
| ) | ||
|
|
||
| // at-least one of the keys should be present in the row. | ||
| val nullFilterClause = groupByConf.keyColumns.toScala.map(key => s"($key IS NOT NULL)").mkString(" OR ") | ||
| inputDf.filter(nullFilterClause) | ||
| } | ||
|
|
||
| def from(groupByConfOld: api.GroupBy, | ||
| queryRange: PartitionRange, | ||
| tableUtils: TableUtils, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Script contains hardcoded user-specific paths.
The script is tied to a specific user's directory structure, limiting reusability.
Make the script more portable:
Also applies to: 20-21, 38-38, 73-73
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 5-5: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
Add error handling for directory change.
The cd command should handle failure cases.
Apply this fix:
📝 Committable suggestion
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 5-5: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents