[HUDI-2395] Rewrite metadata tests using HoodieTestTable#3595
[HUDI-2395] Rewrite metadata tests using HoodieTestTable#3595nsivabalan wants to merge 1 commit intoapache:masterfrom
Conversation
0183d68 to
031fe67
Compare
There was a problem hiding this comment.
Note to Reviewer: This is an almost replica of existing ValidateMetadata except that this uses HoodieTestTable as source table for validation. As mentioned in the description, will be adding more tests and eventually will remove direct SparkRDDClient based tests.
vinothchandar
left a comment
There was a problem hiding this comment.
I was hoping the test shrinks in size, not expand? do you plan to do another pass to remove the redundant tests? any improvments in runtime.
In general, I want to make sure the new methods added to HoodieTestTable are in line with its design. @xushiyan any comments on that?
There was a problem hiding this comment.
this does not belong here? in FileCreateUtils? its not really creating anything
There was a problem hiding this comment.
there exists some functions too like getXXX and deleteXXX. Maybe it's time to rename this to FileCRUDUtils?
There was a problem hiding this comment.
can this be less verbose, like calling .compare on the getModificationTime()?
There was a problem hiding this comment.
this map needs to be made nicer to read?
031fe67 to
37a2c3e
Compare
@vinothchandar @nsivabalan At the beginning we were thinking make Some thoughts on the re-design: developers are familiar with HoodieXXXClient so we would need
|
| tableType = HoodieTableType.COPY_ON_WRITE; | ||
| init(tableType); |
There was a problem hiding this comment.
ideally all test cases should parameterize with table type
@ParameterizedTest
@EnumSource(HoodieTableType.class)
| testBootstrap(testTable,true); | ||
| } | ||
|
|
||
| private void testBootstrap(HoodieTestTable testTable, boolean addRollback) throws Exception { |
There was a problem hiding this comment.
i've seen this pattern in many classes: create a private method doing all the test steps with a variable control different scenarios while different testing methods invoke it with the variable. We should start avoiding this, for reasons
- control flow is an anti-pattern in test code. Each testcase just follows a simple flow: prep -> execute -> verify. Any varying part can be moved to a different test method to explicitly show a different scenario
- I can see the use of control flow is mainly to reuse some code in the original flow. It's a sign that the original flow's code itself is not concise enough to be repeated. I think repeating some code across testcase is acceptable and even preferred: testcases should be isolated and people wants to read the flow as is without jumping back and forth btw methods. Repeating concise test prep and verification logic makes the scenario more readable and manageable in 1 place. This requires the test utils classes properly refactored and doing heavy liftings.
| private void testBootstrap(HoodieTestTable testTable, boolean addRollback) throws Exception { | ||
|
|
||
| // bootstrap w/ 3 or 5 commits | ||
| testTable.doWriteOperation(testTable, "001", WriteOperationType.INSERT, Arrays.asList("p1", "p2"), Arrays.asList("p1", "p2"), |
There was a problem hiding this comment.
try making use of varargs instead of List for test util APIs. varargs gives more flexibility and does not require caller to build a list (less code)
| /** | ||
| * 1. Enable metadata to sync and validate. | ||
| * 2. Disable metadata and add few writes to table. | ||
| * 3. Enable back again to sync and validate. | ||
| * @throws Exception | ||
| */ |
There was a problem hiding this comment.
@throws Exception looks redundant here. most of the time we just let exception throw and investigate the failure.
There was a problem hiding this comment.
if test logic is encapsulate in well-design util APIs, we may not need extra javadoc to explain the flow. Some inline comments might still be helpful but ideally code itself should be able to explain it pretty well
| assertEquals(fsStatuses.length, partitionToFilesMap.get(basePath + "/" + partition).length); | ||
|
|
||
| // File sizes should be valid | ||
| Arrays.stream(metaStatuses).forEach(s -> assertTrue(s.getLen() > 0)); |
There was a problem hiding this comment.
we should prefer for-loop over lambda in test code when there is exception to avoid try-catch block. Just declare exception all the way up we can anyway capture it when test failed.
| } | ||
|
|
||
| public HoodieCommitMetadata createCommitMetadata(WriteOperationType operationType, String commitTime, | ||
| Map<String, List<Pair<String, Integer>>> partitionToFileIdMap) { |
There was a problem hiding this comment.
should try encapsulate data structure like partitionToFileIdMap within HoodieTestState and make it invisible to users. It's not easy to grasp and keep recalling what info is kept in the Map. And more friction of using it in an API
| return testTable.addCompaction(commitTime, commitMetadata); | ||
| } | ||
|
|
||
| public Pair<HoodieCommitMetadata, PartitionFileInfoMap> doWriteOperation(HoodieTestTable testTable, String commitTime, WriteOperationType operationType, |
There was a problem hiding this comment.
this is an instance method, it does not need user to pass in a testTable. Unless you want this to be static?
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class PartitionDeleteFileList { |
There was a problem hiding this comment.
as discussed, we can start creating a HoodieTestState and encapsulate it there.
| import java.util.Map; | ||
| import java.util.UUID; | ||
|
|
||
| public class PartitionFileInfoMap { |
| public Map<String, List<Pair<String, Integer>>> getPartitionToFileIdMap(String commitTime) { | ||
| return this.partitionToFileIdMap.get(commitTime); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
should fix the IDE setting to auto fix EOL problem
|
Closing in favor of #3695 |
What is the purpose of the pull request
Adding tests to Metadata table based on HoodieTestTable. Objective is to make the tests lean and consistent. Especially contents of data files does not matter for metadata, we have an opportunity to make it simpler.
Brief change log
Verify this pull request
Change itself is just around tests.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.