-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds
#4866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c104012 to
062da23
Compare
HoodieTestDataGenerator to provide for reproducible Builds
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Thanks for the improvement! I left a few moments.
| bytes[6] &= 0x0f; | ||
| bytes[6] |= 0x40; | ||
| bytes[8] &= 0x3f; | ||
| bytes[8] |= 0x80; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for having this logic? trying to bound the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some standard v4 bits clearing/setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this method deserves some usage guide in javadoc?
| public static final Schema FLATTENED_AVRO_SCHEMA = new Schema.Parser().parse(TRIP_FLATTENED_SCHEMA); | ||
|
|
||
| private static final Random RAND = new Random(46474747); | ||
| private final Random r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to rand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-)
Why do you think rand is better than r?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried to search the variable and see if anything is missed. r gives me hard time :) Besides, I usually use rand and try to avoid single character variable. More like a style thing, not strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
| String fileId2 = UUID.randomUUID().toString(); | ||
| FileSystem fs = FSUtils.getFs(basePath(), hadoopConf()); | ||
| HoodieTestDataGenerator.writePartitionMetadata(fs, HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, tablePath); | ||
| HoodieTestDataGenerator.writePartitionMetadataDeprecated(fs, HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, tablePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these going to be cleaned up in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, we can let those die down naturally
| public static final TypeDescription ORC_TRIP_SCHEMA = AvroOrcUtils.createOrcSchema(new Schema.Parser().parse(TRIP_SCHEMA)); | ||
| public static final Schema FLATTENED_AVRO_SCHEMA = new Schema.Parser().parse(TRIP_FLATTENED_SCHEMA); | ||
|
|
||
| private static final Random RAND = new Random(46474747); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this always generates the same seq of "random" numbers before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
| } | ||
|
|
||
| public HoodieTestDataGenerator(long seed, String[] partitionPaths, Map<Integer, KeyPartition> keyPartitionMap) { | ||
| this.r = new Random(seed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the seed be printed in logs here, so that if CI runs fail, the seed can be used for reproducing locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you fix this somewhere else?
| () -> partitionPaths[RAND.nextInt(partitionPaths.length)], | ||
| () -> UUID.randomUUID().toString()); | ||
| () -> partitionPaths[r.nextInt(partitionPaths.length)], | ||
| () -> genPseudoRandomUUID(r).toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be fixed as well:
public List<HoodieRecord> generateInsertsForPartition(String instantTime, Integer n, String partition) {
return generateInsertsStream(instantTime, n, false, TRIP_EXAMPLE_SCHEMA, false, () -> partition, () -> UUID.randomUUID().toString()).collect(Collectors.toList());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Weirdly my text search skipped through these
| for (int i = 0; i < limit; i++) { | ||
| String partitionPath = partitionPaths[RAND.nextInt(partitionPaths.length)]; | ||
| String partitionPath = partitionPaths[r.nextInt(partitionPaths.length)]; | ||
| HoodieKey key = new HoodieKey(UUID.randomUUID().toString(), partitionPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for UUID.randomUUID().
| public List<GenericRecord> generateGenericRecords(int numRecords) { | ||
| List<GenericRecord> list = new ArrayList<>(); | ||
| IntStream.range(0, numRecords).forEach(i -> { | ||
| list.add(generateGenericRecord(UUID.randomUUID().toString(), "0", UUID.randomUUID().toString(), UUID.randomUUID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here regarding UUID.randomUUID() if full reproducibility is the goal.
| rec.put("nation", ByteBuffer.wrap(bytes)); | ||
| long currentTimeMillis = System.currentTimeMillis(); | ||
| Date date = new Date(currentTimeMillis); | ||
| long randomMillis = genRandomTimeMillis(r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit for thought. Previously, each time generateGenericRecord() is called, the timestamp monotonically increases. If some tests depend on this assumption, the new time millis random gen should also honor that in a way, e.g., start with a random millis, and increment it by another positive random millis for each subsequent call, to guarantee the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is an assumption that tests should be able to make. At the end of the day there's no such contract provided by the Generator, and tests should not depend on such implementation detail.
f1eafce to
e6627d1
Compare
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| long currentTimeMillis = System.currentTimeMillis(); | ||
| Date date = new Date(currentTimeMillis); | ||
| long randomMillis = genRandomTimeMillis(r); | ||
| Date date = new Date(randomMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: while you're on this, it may worth migrating away from Date to java.time APIs, for standardizing datetime usage in the codebase and be thread-safe where applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no problem. Which one are you referring to? There's no such thing as DateTime in java.time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably LocalDate we need
| bytes[6] &= 0x0f; | ||
| bytes[6] |= 0x40; | ||
| bytes[8] &= 0x3f; | ||
| bytes[8] |= 0x80; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this method deserves some usage guide in javadoc?
2c85afe to
ff1ccbd
Compare
95c0c7b to
972b494
Compare
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tips
What is the purpose of the pull request
Refactoring HoodieTestDataGenerator to provide for reproducible Builds: currently,
HoodieTestDataGeneratorrelies on static state which make its state shared across all of the tests making data generation dependent on the order of execution.Instead we should properly abstract
HoodieTestDataGeneratorto hold all of the state w/in individual instances so that individual Tests can:UUID.randomUUID(),System.currentTimeMillis(), etc)Brief change log
See above
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.