-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5023] Switching default Write Executor type to SIMPLE
#7476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
SIMPLESIMPLE
|
@hudi-bot run azure |
| public Builder withWriteBufferSize(int size) { | ||
| writeConfig.setValue(WRITE_DISRUPTOR_BUFFER_SIZE, String.valueOf(size)); | ||
| writeConfig.setValue(WRITE_EXECUTOR_DISRUPTOR_BUFFER_SIZE, String.valueOf(size)); | ||
| return this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If this is not intended to use as a public API, could you remove this and only keep Builder#withWriteExecutorDisruptorWriteBufferSize which has the same functionality?
| } | ||
|
|
||
| @VisibleForTesting | ||
| public HoodieWriteConfig build(boolean shouldValidate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this method use the default scope/visibility instead of public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is that it's being used in tests in different packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We should think about how to hide this, so it is not used by users accidentally.
| private final String instantTime = HoodieActiveTimeline.createNewInstantTime(); | ||
|
|
||
|
|
||
| private final HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (not required in this PR) the tests using different types of write executors should be generalized in a base class, like TestWriteMarkersBase, instead of duplicating the code. This can be refactored later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Created a task to follow-up HUDI-5622
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor(hoodieRecords.iterator(), consumer, Function.identity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still use getTransformer instead of Function.identity() for the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't need to clone in that case
|
@alexeykudinkin are we claiming that there is no downside to this? esp on large file size? does this still provide same performance. Love to understand benchmarks |
4c401fc to
5935539
Compare
5935539 to
9956563
Compare
| .withDocumentation("Size of in-memory buffer used for parallelizing network reads and lake storage writes."); | ||
|
|
||
| public static final ConfigProperty<String> WRITE_DISRUPTOR_BUFFER_SIZE = ConfigProperty | ||
| public static final ConfigProperty<String> WRITE_EXECUTOR_DISRUPTOR_BUFFER_SIZE = ConfigProperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Let's take this in a separate PR) What is the unit of this config (B, KB, or MB)? We should mention the unit in the config naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: will follow-up w/ separate PR addressing this
| } | ||
|
|
||
| @VisibleForTesting | ||
| public HoodieWriteConfig build(boolean shouldValidate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We should think about how to hide this, so it is not used by users accidentally.
| exec = new DisruptorExecutor(hoodieWriteConfig.getDisruptorWriteBufferSize(), hoodieRecords.iterator(), consumer, | ||
| getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA), Option.of(WaitStrategyFactory.DEFAULT_STRATEGY), getPreExecuteRunnable()); | ||
| exec = new DisruptorExecutor<>(writeConfig.getWriteExecutorDisruptorWriteBufferSize(), hoodieRecords.iterator(), consumer, | ||
| Function.identity(), WaitStrategyFactory.DEFAULT_STRATEGY, getPreExecuteRunnable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also use getTransformer(HoodieTestDataGenerator.AVRO_SCHEMA, writeConfig)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to over-complicate this test -- it's goal is to test the Executor not how it's being used
| getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA), Option.of(WaitStrategyFactory.DEFAULT_STRATEGY), getPreExecuteRunnable()); | ||
| DisruptorExecutor<HoodieRecord, HoodieRecord, Integer> | ||
| executor = new DisruptorExecutor<>(1024, hoodieRecords.iterator(), consumer, | ||
| Function.identity(), WaitStrategyFactory.DEFAULT_STRATEGY, getPreExecuteRunnable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor<>(hoodieRecords.iterator(), consumer, Function.identity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here for getTransformer. The production code passes the transfer in by calling getTransformer so let's follow the same in the test code.
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor<>(hoodieRecords.iterator(), consumer, Function.identity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
| SimpleHoodieExecutor<HoodieRecord, Tuple2<HoodieRecord, Option<IndexedRecord>>, Integer> exec = | ||
| new SimpleHoodieExecutor(iterator, consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| SimpleExecutor<HoodieRecord, HoodieRecord, Integer> exec = | ||
| new SimpleExecutor<>(iterator, consumer, Function.identity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
|
@alexeykudinkin let's make sure we cover different scenarios (including the different base file sizes) in the benchmarking, and that there is no regression in common use cases before landing this PR. |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
Change Logs
This change switches default Write Executor to be
SIMPLEie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one).This should considerably trim down on
Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
Impact
Users w/ upsert-/insert-heavy payloads should see considerable boost in writing performance.
Risk level (write none, low medium or high below)
Low
Documentation Update
N/A
Contributor's checklist