[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… #5137

xiarixiaoyao · 2022-03-26T14:48:36Z

…iting

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2022-03-26T15:03:09Z

@xushiyan @YannByron @leesf @alexeykudinkin
could you pls help me review this pr, thanks

its a serious bug
before pacth: 295553 ms
after patch: 5279 ms

    val dfx = spark.range(0, 50000000).toDF("id")
      .withColumn("c1", lit("dsfsdfsafsasdfa"))
      .withColumn("c2", lit(12.99d))
      .withColumn("c3", lit(1))

    val avroSchemax = AvroConversionUtils.convertStructTypeToAvroSchema(dfx.schema, "record", "my")
    val sparkSchema = dfx.schema
    spark.sparkContext.getConf.registerAvroSchemas(avroSchemax)

    val testRDD = HoodieSparkUtils.createRdd(dfx,"record", "my", Some(avroSchemax))

// warm up
    dfx.count()
    spark.time(testRDD.foreach(f => f))

xushiyan

LGTM. thanks for making the patch!

alexeykudinkin

Thanks for fixing this @xiarixiaoyao !

alexeykudinkin · 2022-03-26T16:46:23Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala

   * @return converter accepting Avro payload and transforming it into a Catalyst one (in the form of [[InternalRow]])
   */
-  def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] =
-    record => sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType)


@xiarixiaoyao that's a very sneaky issue

Please leave a note in a comment explaining what was the issue and how you've addressed it.

However, i'd suggest to just pull out SparkAdapter from the closure keeping the API of this method intact: we don't want to push casting (to InternalRow/GenericRecord) onto the users

def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = { val deserilizer = sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType) record => deserializer.deserialize(record).map(_.asInstanceOf[InternalRow]) }

alexeykudinkin · 2022-03-26T16:57:01Z

@xiarixiaoyao also, please update the PR description so that whoever will be reviewing this PR will not need to go to JIRA to understand what it is about

alexeykudinkin · 2022-03-26T17:23:20Z

@xiarixiaoyao @xushiyan let's also think about how we can prevent such regressions in the future. Ideally, we should check in the test that you used to validate it as a smoke test (checking in that it completes, say, w/in 10s).

Alternatively, we can also reshape and commit it as a benchmark (also committing its output as a reference) so that we can at least verify it manually by running it periodically and comparing against the baseline.

…iting

alexeykudinkin · 2022-03-27T04:05:41Z

@xiarixiaoyao thanks for fixing this!

Can you please also add the test that you've used as a benchmark (based on JMH)?

xiarixiaoyao · 2022-03-27T04:44:42Z

@xiarixiaoyao thanks for fixing this!

Can you please also add the test that you've used as a benchmark (based on JMH)?

i will add benchmark to cover it. not jmh.

YannByron · 2022-03-27T04:52:02Z

LGTM

add avroSerDerBenchmark

xiarixiaoyao · 2022-03-27T07:09:58Z

@alexeykudinkin
add hoodie benchmark framework which modified from spark（diff spark has diff benchmark framework， we cannot refrerence directly）
add benchmark for avroDerSer

i think we should add more benchmarks for hot codes to prevent regressions

hudi-bot · 2022-03-27T09:06:31Z

CI report:

dcec725 UNKNOWN
88ccdb5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-03-28T14:37:41Z

thanks for finding the regression and fixing it. good job!

apache#5137) * [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing * add benchmark framework which modify from spark add avroSerDerBenchmark

xushiyan approved these changes Mar 26, 2022

View reviewed changes

xushiyan added the priority:critical Production degraded; pipelines stalled label Mar 26, 2022

xushiyan linked an issue Mar 26, 2022 that may be closed by this pull request

[SUPPORT] High performance costs of AvroSerializer in Datasource writing #5107

Closed

alexeykudinkin reviewed Mar 26, 2022

View reviewed changes

xiarixiaoyao force-pushed the regression branch 2 times, most recently from d85c848 to dcec725 Compare March 27, 2022 03:16

[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr…

eaf08d6

…iting

xiarixiaoyao force-pushed the regression branch from dcec725 to eaf08d6 Compare March 27, 2022 03:18

alexeykudinkin approved these changes Mar 27, 2022

View reviewed changes

add benchmark framework which modify from spark

88ccdb5

add avroSerDerBenchmark

xushiyan merged commit 9da2dd4 into apache:master Mar 27, 2022

[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… #5137

[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… #5137

Uh oh!

Conversation

xiarixiaoyao commented Mar 26, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

xiarixiaoyao commented Mar 26, 2022

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Mar 26, 2022

Choose a reason for hiding this comment

Uh oh!

xiarixiaoyao Mar 27, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 26, 2022

Uh oh!

alexeykudinkin commented Mar 26, 2022

Uh oh!

alexeykudinkin commented Mar 27, 2022

Uh oh!

xiarixiaoyao commented Mar 27, 2022

Uh oh!

YannByron commented Mar 27, 2022

Uh oh!

xiarixiaoyao commented Mar 27, 2022

Uh oh!

hudi-bot commented Mar 27, 2022

CI report:

Uh oh!

nsivabalan commented Mar 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants