Revert Flink 1 task slot per TM and bump parallelism (#565)

piyush-zlai · web-flow · commit dbc6f40fac01 · 2025-04-01T10:20:37.000-04:00
## Summary
We do see our existing Flink jobs (beacon listing actions) are just a
touch overscaled. This seems to work to absorb event spikes but can be
problematic if we're catching up when the job is down for some time.
This PR bumps our parallelism up and also reverts the setting where we
were going with 1 task slot / TM. We don't need that anymore as we've
patched our catalyst code to handle generate exec nodes in the plan. So
we can go back to running with task slots / TM. So we'll need the same
resources as prior to this PR but get 2x the parallelism to allow us to
catch up quicker.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;

## Summary by CodeRabbit

- **Chores**
- Enhanced resource management and processing parallelism to improve
performance under load.
- Adjusted data scaling for more efficient and responsive streaming
operations.

&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala b/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala
@@ -121,17 +121,18 @@ class DataprocSubmitter(jobControllerClient: JobControllerClient, conf: Submitte
     val envProps =
       Map(
         "jobmanager.memory.process.size" -> "4G",
-        "taskmanager.memory.process.size" -> "32G",
-        "taskmanager.memory.network.min" -> "512m",
-        "taskmanager.memory.network.max" -> "1G",
+        "taskmanager.memory.process.size" -> "64G",
+        "taskmanager.memory.network.min" -> "1G",
+        "taskmanager.memory.network.max" -> "2G",
         // explicitly set the number of task slots as otherwise it defaults to the number of cores
-        // we go with one task slot per TM as we do see issues with Spark setting updates not being respected when there's multiple slots/TM
-        "taskmanager.numberOfTaskSlots" -> "1",
+        // we go with multiple slots per TM as it allows us to squeeze more parallelism out of our resources
+        // this is something we can revisit if we update Spark settings in CatalystUtil as we occasionally see them being overridden
+        "taskmanager.numberOfTaskSlots" -> "4",
         "taskmanager.memory.managed.fraction" -> "0.5f",
         // default is 256m, we seem to be close to the limit so we give ourselves some headroom
         "taskmanager.memory.jvm-metaspace.size" -> "512m",
         // bump this a bit as Kafka and KV stores often need direct buffers
-        "taskmanager.memory.task.off-heap.size" -> "512m",
+        "taskmanager.memory.task.off-heap.size" -> "1G",
         "yarn.classpath.include-user-jar" -> "FIRST",
         "state.savepoints.dir" -> flinkStateUri,
         "state.checkpoints.dir" -> flinkStateUri,
diff --git a/flink/src/main/scala/ai/chronon/flink/KafkaFlinkSource.scala b/flink/src/main/scala/ai/chronon/flink/KafkaFlinkSource.scala
@@ -29,7 +29,7 @@ class BaseKafkaFlinkSource[T](kafkaBootstrap: Option[String],
   TopicChecker.topicShouldExist(topicInfo.name, bootstrap, topicInfo.params)
 
   // we use a small scale factor as topics are often over partitioned. We can make this configurable via topicInfo
-  val scaleFactor = 0.125
+  val scaleFactor = 0.25
 
   implicit val parallelism: Int = {
     math.ceil(TopicChecker.getPartitions(topicInfo.name, bootstrap, topicInfo.params) * scaleFactor).toInt