Skip to content

[GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion#3693

Merged
ZihanLi58 merged 5 commits intoapache:masterfrom
ZihanLi58:GOBBLIN-1830
May 19, 2023
Merged

[GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion#3693
ZihanLi58 merged 5 commits intoapache:masterfrom
ZihanLi58:GOBBLIN-1830

Conversation

@ZihanLi58
Copy link
Copy Markdown
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Currently, we rely on the flushing event to indicate when a container transition occurs. However, if the ingestion process fails to successfully flush the data, we won't know which container is responsible. This becomes problematic when the pipeline restarts, as we won't be able to identify the root cause without manually reviewing the logs of thousands of containers. To avoid this issue, we need to develop a more reliable way of tracking container transitions that doesn't rely solely on the flushing event.

The change for this PR is to emit the event when the extractor is firstly initialized

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit test, and print out the event information:
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: jobName
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: KafkaHdfsStreamingTrackingOrcTest
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: helixInstance
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: GobblinYarnTaskRunner_1
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: taskAttemptId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: GobblinYarnTaskRunner_1
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: kafkaTopic
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: PremiumInsightsNotableAlumniImpressionEvent
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: dataset.urn
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value:
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: clusterIdentifier
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: holdem
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: construct
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: Extractor
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: metricContextName
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: org.apache.gobblin.prototype.kafka.KafkaAvroBinaryStreamingExtractor.510352353
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: jobId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: job_KafkaHdfsStreamingTrackingOrcTest_1683594035964
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: helixTaskId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: b5ef4c8b-7b3d-47fb-9380-d92c11509050
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: partition
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: 0
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: metricContextID
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: 7355f230-dad9-450f-b8e4-4e4c6a0e0b87
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: helixJobId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: job_KafkaHdfsStreamingTrackingOrcTest_1683594035964_job_KafkaHdfsStreamingTrackingOrcTest_1683594035964
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: topic
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: PremiumInsightsNotableAlumniImpressionEvent
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: containerNode
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: xxxx.xxx.xx.xx (remove this info as it's internal specific)
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: containerId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: container_e32_1683322902150_274389_01_000003
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: class
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: org.apache.gobblin.prototype.kafka.KafkaAvroBinaryStreamingExtractor
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$ key: taskId
    2023-05-08 18:03:11 PDT INFO [TaskStateModelFactory-task_thread-0] org.apache.gobblin.metrics.MetricContext - $$$value: task_KafkaHdfsStreamingTrackingOrcTest_1683594035964_0

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2023

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 46.99%. Comparing base (0c014ee) to head (11c93f3).
⚠️ Report is 432 commits behind head on master.

Files with missing lines Patch % Lines
...tractor/extract/kafka/KafkaStreamingExtractor.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3693      +/-   ##
============================================
- Coverage     49.48%   46.99%   -2.50%     
- Complexity     9309    10793    +1484     
============================================
  Files          1756     2138     +382     
  Lines         68394    84077   +15683     
  Branches       7794     9342    +1548     
============================================
+ Hits          33846    39511    +5665     
- Misses        31416    40971    +9555     
- Partials       3132     3595     +463     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ZihanLi58 ZihanLi58 merged commit 85b0a1e into apache:master May 19, 2023
phet added a commit to phet/gobblin that referenced this pull request Aug 15, 2023
* upstream/master:
  Fix bug with total count watermark whitelist (apache#3724)
  [GOBBLIN-1858] Fix logs relating to multi-active lease arbiter (apache#3720)
  [GOBBLIN-1838] Introduce total count based completion watermark (apache#3701)
  Correct num of failures (apache#3722)
  [GOBBLIN- 1856] Add flow trigger handler leasing metrics (apache#3717)
  [GOBBLIN-1857] Add override flag to force generate a job execution id based on gobbl… (apache#3719)
  [GOBBLIN-1855] Metadata writer tests do not work in isolation after upgrading to Iceberg 1.2.0 (apache#3718)
  Remove unused ORC writer code (apache#3710)
  [GOBBLIN-1853] Reduce # of Hive calls during schema related updates (apache#3716)
  [GOBBLIN-1851] Unit tests for MysqlMultiActiveLeaseArbiter with Single Participant (apache#3715)
  [GOBBLIN-1848] Add tags to dagmanager metrics for extensibility (apache#3712)
  [GOBBLIN-1849] Add Flow Group & Name to Job Config for Job Scheduler (apache#3713)
  [GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup (apache#3708)
  [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time (apache#3704)
  [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched (apache#3711)
  [GOBBLIN-1842] Add timers to GobblinMCEWriter (apache#3703)
  [GOBBLIN-1844] Ignore workflows marked for deletion when calculating container count (apache#3709)
  [GOBBLIN-1846] Validate Multi-active Scheduler with Logs (apache#3707)
  [GOBBLIN-1845] Changes parallelstream to stream in DatasetsFinderFilteringDecorator  to avoid classloader issues in spark (apache#3706)
  [GOBBLIN-1843] Utility for detecting non optional unions should convert dataset urn to hive compatible format (apache#3705)
  [GOBBLIN-1837] Implement multi-active, non blocking for leader host (apache#3700)
  [GOBBLIN-1835]Upgrade Iceberg Version from 0.11.1 to 1.2.0 (apache#3697)
  Update CHANGELOG to reflect changes in 0.17.0
  Reserving 0.18.0 version for next release
  [GOBBLIN-1836] Ensuring Task Reliability: Handling Job Cancellation and Graceful Exits for Error-Free Completion (apache#3699)
  [GOBBLIN-1805] Check watermark for the most recent hour for quiet topics (apache#3698)
  [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail (apache#3687)
  [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (apache#3692)
  [GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion (apache#3693)
  [GOBBLIN-1833]Emit Completeness watermark information in snapshotCommitEvent (apache#3696)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants