-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-23202][SQL] Add new API in DataSourceWriter: onDataWriterCommit #20454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #86871 has finished for PR 20454 at commit
|
|
Test build #86875 has finished for PR 20454 at commit
|
|
Test build #86877 has finished for PR 20454 at commit
|
|
retest this please. |
| * | ||
| * If this method fails (by throwing an exception), this writing job is considered to to have been | ||
| * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination | ||
| * is undefined and {@link #abort(WriterCommitMessage[])} may not be able to deal with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean that "the state of the destination is undefined"? I think it is sufficient to say that abort will be called and the contract for aborting commits applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, let's remove the last sentence.
|
+1 I'd rather not add features without a known use case, but this implementation looks good to me. |
|
Test build #86884 has finished for PR 20454 at commit
|
|
Test build #86914 has finished for PR 20454 at commit
|
|
adding a default method to a java interface is binary compatible, I'm merging this to master only, to follow @rxin 's suggestion about not adding new stuff to 2.3, thanks! |
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: `add(WriterCommitMessage message)`: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible. There was another radical attempt in apache#20386. This one should be more reasonable. Unit test Author: Wang Gengliang <[email protected]> Closes apache#20454 from gengliangwang/write_api. (cherry picked from commit 9907bcfa045f96fb23822dc10eb3a2a42a6832d4) Conflicts: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: `add(WriterCommitMessage message)`: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible. There was another radical attempt in apache#20386. This one should be more reasonable. Unit test Author: Wang Gengliang <[email protected]> Closes apache#20454 from gengliangwang/write_api.
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: `add(WriterCommitMessage message)`: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible. There was another radical attempt in apache#20386. This one should be more reasonable. Unit test Author: Wang Gengliang <[email protected]> Closes apache#20454 from gengliangwang/write_api. (cherry picked from commit 9907bcfa045f96fb23822dc10eb3a2a42a6832d4) Conflicts: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
…nDataWriterCommit The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: `add(WriterCommitMessage message)`: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible. There was another radical attempt in apache#20386. This one should be more reasonable. Unit test Author: Wang Gengliang <[email protected]> Closes apache#20454 from gengliangwang/write_api. RB=1824728 A=
What changes were proposed in this pull request?
The current DataSourceWriter API makes it hard to implement
onTaskCommit(taskCommit: TaskCommitMessage)inFileCommitProtocol.In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.
The proposal to add a new API:
add(WriterCommitMessage message): Handles a commit message on receiving from a successful data writer.This should make the whole API of DataSourceWriter compatible with
FileCommitProtocol, and more flexible.There was another radical attempt in #20386. This one should be more reasonable.
How was this patch tested?
Unit test