Skip to content

[HUDI-5193] Enhancing spark-ds write tests for some of the core user flows#7179

Merged
xushiyan merged 6 commits intoapache:masterfrom
nsivabalan:spark-ds-tests-enhance
May 29, 2023
Merged

[HUDI-5193] Enhancing spark-ds write tests for some of the core user flows#7179
xushiyan merged 6 commits intoapache:masterfrom
nsivabalan:spark-ds-tests-enhance

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan commented Nov 10, 2022

Change Logs

We realized we don't have good test coverage for some of the core user flows w/ spark data source writes. So, enhancing the tests to cover the scenarios.

Tests coverage added w/ spark-data source writes:

COW and MOR * (w/ and w/o metadata)
    Partitioned(BLOOM, SIMPLE, GLOBAL_BLOOM), non-partitioned(GLOBAL_BLOOM).
        Immutable data. pure bulk_insert row writing. 
        Immutable w/ file sizing. pure inserts. 
        initial bulk ingest, followed by updates.

Impact

Will help catch any bugs around core user flows upfront.

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan changed the title [WIP] enhancing spark-ds tests [HUDI-5193] Enhancing spark-ds write tests for some of the core user flows Nov 10, 2022
@nsivabalan nsivabalan marked this pull request as ready for review November 10, 2022 20:11
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Nov 10, 2022
@xushiyan xushiyan self-assigned this Nov 13, 2022
@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 16, 2022

def compareUpdateDfWithHudiDf(inputDf: Dataset[Row], hudiDf: Dataset[Row], beforeDf: Dataset[Row], colsToCompare: String ) : Unit = {
    val hudiWithoutMeta = hudiDf.drop(HoodieRecord.RECORD_KEY_METADATA_FIELD, HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, HoodieRecord.COMMIT_TIME_METADATA_FIELD, HoodieRecord.FILENAME_METADATA_FIELD)
    hudiWithoutMeta.registerTempTable("hudiTbl")
    inputDf.registerTempTable("inputTbl")
    beforeDf.registerTempTable("beforeTbl")
    val hudiDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from hudiTbl")
    val inputDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from inputTbl")
    val beforeDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from beforeTbl")

    assertEquals(hudiDfToCompare.intersect(inputDfToCompare).count, inputDfToCompare.count)
    assertEquals(hudiDfToCompare.except(inputDfToCompare).except(beforeDfToCompare).count, 0)
  }

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 16, 2022

In commonOpts you have

DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",

but then in options you set

(DataSourceWriteOptions.RECORDKEY_FIELD.key() -> recordKeys) +

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 16, 2022

I think doMORReadOptimizedQquery should call compareEntireInputDfWithHudiDf instead of compareUpdateDfWithHudiDf but I am less confident about this than my first two suggestions

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 16, 2022

read incremental query doesn't have .option(HoodieMetadataConfig.ENABLE.key, isMetadataEnabledOnRead)

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 16, 2022

think its supposed to be records4 not records3
var inputDF4 = spark.read.json(spark.sparkContext.parallelize(records3, 2))

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 17, 2022

doMORReadOptimizedQquery has two Q's

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 18, 2022

Inconsistent use of DF and Df

@jonvex
Copy link
Copy Markdown
Contributor

jonvex commented Nov 21, 2022

Changes look good to me

@nsivabalan nsivabalan force-pushed the spark-ds-tests-enhance branch 2 times, most recently from c5c2e42 to d4e392c Compare November 23, 2022 00:18
@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope added test-coverage and removed release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@xushiyan xushiyan force-pushed the spark-ds-tests-enhance branch 2 times, most recently from 51623ae to a508f40 Compare May 28, 2023 10:49
@xushiyan xushiyan force-pushed the spark-ds-tests-enhance branch from a508f40 to c92b4ad Compare May 29, 2023 03:26
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled release-0.14.0

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants