Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jan 20, 2023

Change Logs

Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests.

Tests that are fixed:
TestHoodieDeltaStreamerWithMultiWriter.* (all tests)
TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable,
testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile)

Reasoning for flakiness:

  1. we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator).
  2. in MOR table tests, due to small file handling, we end up expanding existing file group to pack both updates and inserts and hence no log files were generated. but tests expects compaction to happen after 5 commits which fails the test in such cases.

Fixes:

  • HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10.
  • Fixed respective MOR tests to disable small file handling.

Impact

NA

Risk level

None

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Jan 20, 2023
@nsivabalan
Copy link
Contributor Author

@hudi-bot run azure

// round robin to ensure we generate inserts for all partition paths
String partitionToUse = partitionPaths[partitionIndex.get()];
partitionIndex.set((partitionIndex.get() + 1) % partitionPaths.length);
return partitionToUse;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so with this change we can guarantee all target partitions have records. then we don't need to bump 10 records to 100? so we can make it faster. We just need to make sure num records > num partitions here

Copy link
Contributor Author

@nsivabalan nsivabalan Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I don't expect any savings from 100 to 10 records. Also most of our tests are doing 100 recs. so just to keep it in parity.
yes, we have only 3 partitions in TestDataGenerator. so, a minimum of 3 is required to ensure we generate for all 3 partitions.

Copy link
Member

@xushiyan xushiyan Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor: 1) from UT perspective, 100 records shouldn't be different from 10 records. the question is: if 10 works, why bump to 100? 2) we should ensure num records > num partitions without knowing there are always 3 partitions so some check arg will help prevent misuse leading to unexpected results

tableBasePath = basePath + "/testtable_" + tableType;
prepareInitialConfigs(fs(), basePath, "foo");
TypedProperties props = prepareMultiWriterProps(fs(), basePath, propsFilePath);
props.setProperty(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key(), "0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be applicable to some cases in TestHoodieDeltaStreamer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I did not want to boil the ocean just yet. so just focussing on most frequently failing tests. will file a jira to track other similar tests. https://issues.apache.org/jira/browse/HUDI-5595

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's land this without doing another round of CI. we should standardize data generation for testing once for all in some later PRs

@alexeykudinkin alexeykudinkin added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Jan 25, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

CI failed due to flaky tests.

@nsivabalan nsivabalan merged commit e969a4c into apache:master Jan 27, 2023
yihua pushed a commit that referenced this pull request Jan 30, 2023
Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests.

Tests that are fixed:
TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable,
testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile)

Reasoning for flakiness:

we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator).

Fixes:

HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10.
Fixed respective MOR tests to disable small file handling.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests.

Tests that are fixed:
TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable,
testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile)

Reasoning for flakiness:

we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator).

Fixes:

HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10.
Fixed respective MOR tests to disable small file handling.
@yihua yihua added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Feb 5, 2023
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests.

Tests that are fixed:
TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable,
testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile)

Reasoning for flakiness:

we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator).

Fixes:

HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10.
Fixed respective MOR tests to disable small file handling.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests.

Tests that are fixed:
TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable,
testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile)

Reasoning for flakiness:

we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator).

Fixes:

HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10.
Fixed respective MOR tests to disable small file handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants