-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… #5137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xushiyan @YannByron @leesf @alexeykudinkin its a serious bug |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks for making the patch!
alexeykudinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @xiarixiaoyao !
| * @return converter accepting Avro payload and transforming it into a Catalyst one (in the form of [[InternalRow]]) | ||
| */ | ||
| def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = | ||
| record => sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao that's a very sneaky issue
Please leave a note in a comment explaining what was the issue and how you've addressed it.
However, i'd suggest to just pull out SparkAdapter from the closure keeping the API of this method intact: we don't want to push casting (to InternalRow/GenericRecord) onto the users
def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = {
val deserilizer = sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType)
record => deserializer.deserialize(record).map(_.asInstanceOf[InternalRow])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
@xiarixiaoyao also, please update the PR description so that whoever will be reviewing this PR will not need to go to JIRA to understand what it is about |
|
@xiarixiaoyao @xushiyan let's also think about how we can prevent such regressions in the future. Ideally, we should check in the test that you used to validate it as a smoke test (checking in that it completes, say, w/in 10s). Alternatively, we can also reshape and commit it as a benchmark (also committing its output as a reference) so that we can at least verify it manually by running it periodically and comparing against the baseline. |
d85c848 to
dcec725
Compare
dcec725 to
eaf08d6
Compare
|
@xiarixiaoyao thanks for fixing this! Can you please also add the test that you've used as a benchmark (based on JMH)? |
i will add benchmark to cover it. not jmh. |
|
LGTM |
add avroSerDerBenchmark
|
@alexeykudinkin i think we should add more benchmarks for hot codes to prevent regressions |
|
thanks for finding the regression and fixing it. good job! |
apache#5137) * [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing * add benchmark framework which modify from spark add avroSerDerBenchmark
…iting
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.