Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Nov 10, 2021

What is the purpose of the pull request

we might have to re-think enabling timeline server based marker as default.
incase of structured streaming, once the source stream completes, it shuts down the timeline server. and so some writes to hudi fails while creating the markers (timeline server based). I need to understand this flow better, whether we need to fix the structured streaming or its an inherent constraint.

#3950 is a pre-requisite to land this PR. If not, I might have to do duplicate test fixes. So, will first land #3950 and then will attempt at fixing the remaining test failures. I did go through once pass already. Except for structured streaming, I don't see any other major issues.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented Nov 12, 2021

@vinothchandar : wrt structured streaming and timeline server being closed ahead, this is what I see. After the end of first micro batch, the write client is closed and hence triggers closure of timeline service. but subsequent micro batches do succeed though.

I added some logs for testStructuredStreaming (using direct markers)

1833 [main] WARN  org.apache.spark.util.Utils  - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
streaming starting
10085 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - writing to HoodieSparkSqlWriter
10381 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieClient  - Constructor 
10381 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieClient  - Starting ETL server
10381 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieClient  - Creating embedded timeline server 
10387 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.embedded.EmbeddedTimelineService  - Starting timeline server :: 
10625 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieWriteClient  - Constructor 
13093 [pool-18-thread-2] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152238/2016/03/15/f743c5c8-6b42-4d4d-93df-2a34d7a3e9d8-0_0-26-248_20211112152238.parquet.marker.CREATE
13802 [pool-20-thread-2] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152238/2015/03/16/ff7dc47f-3c79-4472-98b0-09b716a5698a-0_1-32-249_20211112152238.parquet.marker.CREATE
13802 [pool-19-thread-2] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152238/2015/03/17/e9193d74-3e04-4096-8e05-a3b872df0a0e-0_2-32-250_20211112152238.parquet.marker.CREATE
14231 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Returning marker files 
14430 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieWriteClient  - Committing 20211112152238 commit
14431 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieWriteClient  - Close() 
14431 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.embedded.EmbeddedTimelineService  - Trying to close timeline server
14431 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.embedded.EmbeddedTimelineService  - Closing Timeline server
14446 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - Micro batch id=0 succeeded for commit=20211112152238
14446 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - Micro batch id=0 succeeded
14494 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor  - Current batch is falling behind. The trigger interval is 100 milliseconds, but spent 4614 milliseconds
15031 [ForkJoinPool-1-worker-11] WARN  org.apache.spark.util.Utils  - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
15571 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - writing to HoodieSparkSqlWriter
17992 [Executor task launch worker for task 315] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152243/2016/03/15/f743c5c8-6b42-4d4d-93df-2a34d7a3e9d8-0_0-67-315_20211112152243.parquet.marker.MERGE
18353 [Executor task launch worker for task 316] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152243/2015/03/16/ff7dc47f-3c79-4472-98b0-09b716a5698a-0_1-73-316_20211112152243.parquet.marker.MERGE
18353 [Executor task launch worker for task 317] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Creating marker file /var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/.temp/20211112152243/2015/03/17/e9193d74-3e04-4096-8e05-a3b872df0a0e-0_2-73-317_20211112152243.parquet.marker.MERGE
18790 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.table.marker.DirectWriteMarkers  - Returning marker files 
18918 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieWriteClient  - Committing 20211112152243 commit
18918 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.client.AbstractHoodieWriteClient  - Close() 
18918 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - Micro batch id=1 succeeded for commit=20211112152243
18918 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.hudi.HoodieStreamingSink  - Micro batch id=1 succeeded
18952 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor  - Current batch is falling behind. The trigger interval is 100 milliseconds, but spent 3472 milliseconds
streaming ends
20593 [ForkJoinPool-1-worker-11] WARN  org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not found at path file:/var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/metadata
22315 [ForkJoinPool-1-worker-11] WARN  org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not found at path file:/var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/dest/.hoodie/metadata
23465 [main] WARN  org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system instance used in previous test-run
23482 [stream execution thread for [id = 41aab979-66f5-4a58-82bc-751838639d48, runId = e3a570e9-9cb2-4216-af1d-791f563be9ba]] WARN  org.apache.spark.sql.execution.datasources.InMemoryFileIndex  - The directory file:/var/folders/ym/8yjkm3n90kq8tk4gfmvk7y140000gn/T/junit3225050705126728730/dataset/source was not found. Was it deleted very recently?

Test was using direct marker types for the purpose of collecting logs. if I switch to timeline server based(intent of this patch), test will fail since the 2nd batch marker creations fail.
Do you think we need to revisit the closure of write client in Structured streaming code?

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Nov 15, 2021
@nsivabalan nsivabalan force-pushed the enableRollbackUsingMarkers branch from 0bb1d33 to 09f87e9 Compare November 17, 2021 10:43
@nsivabalan nsivabalan force-pushed the enableRollbackUsingMarkers branch from b897418 to 0ab4fdb Compare November 18, 2021 19:48
@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

@nsivabalan
Copy link
Contributor Author

@vinothchandar : I am thinking, for users who explicitly disable timeline server, should we fallback to using direct style markers?

@hudi-bot
Copy link
Collaborator

CI report:

  • b897418a9be2a1bbf4efd3d2a191d1d1f96f1a1e UNKNOWN
  • 6f83276 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

@nsivabalan nsivabalan changed the title [HUDI-2151] Part 4 Enabling timeline server based marker as default [HUDI-2767] Enabling timeline server based marker as default Nov 19, 2021
@leesf
Copy link
Contributor

leesf commented Nov 21, 2021

@vinothchandar : I am thinking, for users who explicitly disable timeline server, should we fallback to using direct style markers?

@nsivabalan We have jobs that disable timeline server in production environment, for backward compatibility, we should fallback to using direct style markers.

.withEngineType(EngineType.JAVA)
.withPath(basePath)
.withSchema(schema.toString())
.withMarkersType(MarkerType.DIRECT.name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan/ticket to convert all these tests to timeline based markers? Or by design these still have to be direct type?

@yihua
Copy link
Contributor

yihua commented Nov 25, 2021

I'm putting up another PR to flip the default and fix things: #4112. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants