Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jan 31, 2018

What changes were proposed in this pull request?

The current DataSourceWriter API makes it hard to implement onTaskCommit(taskCommit: TaskCommitMessage) in FileCommitProtocol.
In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.

The proposal to add a new API:
add(WriterCommitMessage message): Handles a commit message on receiving from a successful data writer.

This should make the whole API of DataSourceWriter compatible with FileCommitProtocol, and more flexible.

There was another radical attempt in #20386. This one should be more reasonable.

How was this patch tested?

Unit test

@gengliangwang gengliangwang changed the title [SPARK-23202][SQL] Add new DataSourceWriter API: onDataWriterCommit [SPARK-23202][SQL] Add new API in DataSourceWriter: onDataWriterCommit Jan 31, 2018
@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86871 has finished for PR 20454 at commit 89776ec.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86875 has finished for PR 20454 at commit 479694b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86877 has finished for PR 20454 at commit 5c873a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

*
* If this method fails (by throwing an exception), this writing job is considered to to have been
* failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
* is undefined and {@link #abort(WriterCommitMessage[])} may not be able to deal with it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that "the state of the destination is undefined"? I think it is sufficient to say that abort will be called and the contract for aborting commits applies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, let's remove the last sentence.

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2018

+1

I'd rather not add features without a known use case, but this implementation looks good to me.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86884 has finished for PR 20454 at commit 5c873a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86914 has finished for PR 20454 at commit 4ae9b5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 1, 2018

adding a default method to a java interface is binary compatible, I'm merging this to master only, to follow @rxin 's suggestion about not adding new stuff to 2.3, thanks!

@asfgit asfgit closed this in ffbca84 Feb 1, 2018
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`.
In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.

The proposal to add a new API:
`add(WriterCommitMessage message)`:  Handles a commit message on receiving from a successful data writer.

This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible.

There was another radical attempt in apache#20386.  This one should be more reasonable.

Unit test

Author: Wang Gengliang <[email protected]>

Closes apache#20454 from gengliangwang/write_api.

(cherry picked from commit 9907bcfa045f96fb23822dc10eb3a2a42a6832d4)

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`.
In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.

The proposal to add a new API:
`add(WriterCommitMessage message)`:  Handles a commit message on receiving from a successful data writer.

This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible.

There was another radical attempt in apache#20386.  This one should be more reasonable.

Unit test

Author: Wang Gengliang <[email protected]>

Closes apache#20454 from gengliangwang/write_api.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`.
In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.

The proposal to add a new API:
`add(WriterCommitMessage message)`:  Handles a commit message on receiving from a successful data writer.

This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible.

There was another radical attempt in apache#20386.  This one should be more reasonable.

Unit test

Author: Wang Gengliang <[email protected]>

Closes apache#20454 from gengliangwang/write_api.

(cherry picked from commit 9907bcfa045f96fb23822dc10eb3a2a42a6832d4)

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…nDataWriterCommit

The current DataSourceWriter API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`.
In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected.

The proposal to add a new API:
`add(WriterCommitMessage message)`:  Handles a commit message on receiving from a successful data writer.

This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible.

There was another radical attempt in apache#20386.  This one should be more reasonable.

Unit test

Author: Wang Gengliang <[email protected]>

Closes apache#20454 from gengliangwang/write_api.

RB=1824728
A=
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants