Skip to content

[HUDI-2395] Rewrite metadata tests using HoodieTestTable#3595

Closed
nsivabalan wants to merge 1 commit intoapache:masterfrom
nsivabalan:metadataTestsRewrite
Closed

[HUDI-2395] Rewrite metadata tests using HoodieTestTable#3595
nsivabalan wants to merge 1 commit intoapache:masterfrom
nsivabalan:metadataTestsRewrite

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Sep 3, 2021

What is the purpose of the pull request

Adding tests to Metadata table based on HoodieTestTable. Objective is to make the tests lean and consistent. Especially contents of data files does not matter for metadata, we have an opportunity to make it simpler.

Brief change log

  • Added few building blocks to HoodieTestTable to assist in this testing. Especially metadata sync relies on commitMetadata and hence all such supporting blocks are added to HoodieTestTable.
  • Added tests covering bootstrap, regular writes, clean, compaction, rollback, inflight operation mid timeline, etc.
  • Tests based on HoodieTestTable tests just the metadata table and hence sync has to be called explicitly simulating actual sequence of sync calls. So, for now, leaving existing tests in TestHoodieBackedMetadata as is. Will think through more on this and come up with a strategy on how to cover those by tests.
  • As of now, focussed on COW. Yet to add log files related tests to MOR.
  • Will be adding more tests to TestHoodieBackedMetadata in the next few days. And eventual goal is to remove all SparkRddWriteClient based tests for metadata.

Verify this pull request

Change itself is just around tests.

  • Added tests to TestHoodieBackedMetadata

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Sep 3, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to Reviewer: This is an almost replica of existing ValidateMetadata except that this uses HoodieTestTable as source table for validation. As mentioned in the description, will be adding more tests and eventually will remove direct SparkRDDClient based tests.

@nsivabalan nsivabalan marked this pull request as ready for review September 3, 2021 15:24
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping the test shrinks in size, not expand? do you plan to do another pass to remove the redundant tests? any improvments in runtime.

In general, I want to make sure the new methods added to HoodieTestTable are in line with its design. @xushiyan any comments on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not belong here? in FileCreateUtils? its not really creating anything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there exists some functions too like getXXX and deleteXXX. Maybe it's time to rename this to FileCRUDUtils?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be less verbose, like calling .compare on the getModificationTime()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this map needs to be made nicer to read?

@xushiyan
Copy link
Member

xushiyan commented Sep 7, 2021

In general, I want to make sure the new methods added to HoodieTestTable are in line with its design. @xushiyan any comments on that?

@vinothchandar @nsivabalan At the beginning we were thinking make HoodieTestTable provide concise APIs doing basic table operations with empty files and HoodieWriteableTestTable writing actual data. FileCreateUtils is used by HoodieTestTable at lower level interfacing with the actual files. When more and more APIs are introduced, we have to re-write many logics and re-implement action-to-file-change translations. This resulted in the testutils having cumbersome code and learning hurdle, also error-prone. I felt some re-design is needed.

Some thoughts on the re-design: developers are familiar with HoodieXXXClient so we would need HoodieTestTable make use of some client with similar public APIs, say HoodieTestTableClient.

  • HoodieTestTable owns a HoodieTestTableClient and exposes the client's APIs for dev to prep their own data
  • HoodieTestTableClient should manipulate timeline and metadata as normal
  • For UTs don't need real data, the client writes out empty log and base files.
  • For FTs, the test table and client should be aware of the EngineContext set by XXXFunctionalTestHarness and act like the actual clients
  • Eventually dev only learn APIs provided by HoodieTestTableClient to prep their tests. It can include high-level APIs like .write100RecordsIn3Partitions() as such.

@vinothchandar vinothchandar self-assigned this Sep 7, 2021
Comment on lines +1145 to +1146
tableType = HoodieTableType.COPY_ON_WRITE;
init(tableType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally all test cases should parameterize with table type

  @ParameterizedTest
  @EnumSource(HoodieTableType.class)

testBootstrap(testTable,true);
}

private void testBootstrap(HoodieTestTable testTable, boolean addRollback) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've seen this pattern in many classes: create a private method doing all the test steps with a variable control different scenarios while different testing methods invoke it with the variable. We should start avoiding this, for reasons

  • control flow is an anti-pattern in test code. Each testcase just follows a simple flow: prep -> execute -> verify. Any varying part can be moved to a different test method to explicitly show a different scenario
  • I can see the use of control flow is mainly to reuse some code in the original flow. It's a sign that the original flow's code itself is not concise enough to be repeated. I think repeating some code across testcase is acceptable and even preferred: testcases should be isolated and people wants to read the flow as is without jumping back and forth btw methods. Repeating concise test prep and verification logic makes the scenario more readable and manageable in 1 place. This requires the test utils classes properly refactored and doing heavy liftings.

private void testBootstrap(HoodieTestTable testTable, boolean addRollback) throws Exception {

// bootstrap w/ 3 or 5 commits
testTable.doWriteOperation(testTable, "001", WriteOperationType.INSERT, Arrays.asList("p1", "p2"), Arrays.asList("p1", "p2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try making use of varargs instead of List for test util APIs. varargs gives more flexibility and does not require caller to build a list (less code)

Comment on lines +1207 to +1212
/**
* 1. Enable metadata to sync and validate.
* 2. Disable metadata and add few writes to table.
* 3. Enable back again to sync and validate.
* @throws Exception
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@throws Exception looks redundant here. most of the time we just let exception throw and investigate the failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if test logic is encapsulate in well-design util APIs, we may not need extra javadoc to explain the flow. Some inline comments might still be helpful but ideally code itself should be able to explain it pretty well

assertEquals(fsStatuses.length, partitionToFilesMap.get(basePath + "/" + partition).length);

// File sizes should be valid
Arrays.stream(metaStatuses).forEach(s -> assertTrue(s.getLen() > 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should prefer for-loop over lambda in test code when there is exception to avoid try-catch block. Just declare exception all the way up we can anyway capture it when test failed.

}

public HoodieCommitMetadata createCommitMetadata(WriteOperationType operationType, String commitTime,
Map<String, List<Pair<String, Integer>>> partitionToFileIdMap) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should try encapsulate data structure like partitionToFileIdMap within HoodieTestState and make it invisible to users. It's not easy to grasp and keep recalling what info is kept in the Map. And more friction of using it in an API

return testTable.addCompaction(commitTime, commitMetadata);
}

public Pair<HoodieCommitMetadata, PartitionFileInfoMap> doWriteOperation(HoodieTestTable testTable, String commitTime, WriteOperationType operationType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an instance method, it does not need user to pass in a testTable. Unless you want this to be static?

import java.util.List;
import java.util.Map;

public class PartitionDeleteFileList {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, we can start creating a HoodieTestState and encapsulate it there.

import java.util.Map;
import java.util.UUID;

public class PartitionFileInfoMap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

public Map<String, List<Pair<String, Integer>>> getPartitionToFileIdMap(String commitTime) {
return this.partitionToFileIdMap.get(commitTime);
}
} No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fix the IDE setting to auto fix EOL problem

@codope codope mentioned this pull request Sep 21, 2021
5 tasks
@nsivabalan
Copy link
Contributor Author

Closing in favor of #3695

@nsivabalan nsivabalan closed this Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants