Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Refactoring HoodieTestDataGenerator to provide for reproducible Builds: currently, HoodieTestDataGenerator relies on static state which make its state shared across all of the tests making data generation dependent on the order of execution.

Instead we should properly abstract HoodieTestDataGenerator to hold all of the state w/in individual instances so that individual Tests can:

  1. Create they own isolated instance (which won't be affected by other Tests)
  2. Pass "seed" value to DataGenerator to init its PRNG w/ it, so that it always produces the same (pseudo-)random sequence (for a given seed)
  3. Be certain that all of the data produced by DataGenerator will be 100% reproducible w/ the same seed (meaning that all of the DataGenerator operations w/in it only rely on such internal PRNG and don't rely on any external sources, such as UUID.randomUUID(), System.currentTimeMillis(), etc)

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds Feb 21, 2022
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for the improvement! I left a few moments.

Comment on lines +899 to +904
bytes[6] &= 0x0f;
bytes[6] |= 0x40;
bytes[8] &= 0x3f;
bytes[8] |= 0x80;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for having this logic? trying to bound the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some standard v4 bits clearing/setting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this method deserves some usage guide in javadoc?

public static final Schema FLATTENED_AVRO_SCHEMA = new Schema.Parser().parse(TRIP_FLATTENED_SCHEMA);

private static final Random RAND = new Random(46474747);
private final Random r;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to rand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)

Why do you think rand is better than r?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to search the variable and see if anything is missed. r gives me hard time :) Besides, I usually use rand and try to avoid single character variable. More like a style thing, not strong opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

String fileId2 = UUID.randomUUID().toString();
FileSystem fs = FSUtils.getFs(basePath(), hadoopConf());
HoodieTestDataGenerator.writePartitionMetadata(fs, HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, tablePath);
HoodieTestDataGenerator.writePartitionMetadataDeprecated(fs, HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, tablePath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these going to be cleaned up in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, we can let those die down naturally

public static final TypeDescription ORC_TRIP_SCHEMA = AvroOrcUtils.createOrcSchema(new Schema.Parser().parse(TRIP_SCHEMA));
public static final Schema FLATTENED_AVRO_SCHEMA = new Schema.Parser().parse(TRIP_FLATTENED_SCHEMA);

private static final Random RAND = new Random(46474747);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this always generates the same seq of "random" numbers before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

}

public HoodieTestDataGenerator(long seed, String[] partitionPaths, Map<Integer, KeyPartition> keyPartitionMap) {
this.r = new Random(seed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the seed be printed in logs here, so that if CI runs fail, the seed can be used for reproducing locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you fix this somewhere else?

() -> partitionPaths[RAND.nextInt(partitionPaths.length)],
() -> UUID.randomUUID().toString());
() -> partitionPaths[r.nextInt(partitionPaths.length)],
() -> genPseudoRandomUUID(r).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be fixed as well:

public List<HoodieRecord> generateInsertsForPartition(String instantTime, Integer n, String partition) {
    return generateInsertsStream(instantTime,  n, false, TRIP_EXAMPLE_SCHEMA, false, () -> partition, () -> UUID.randomUUID().toString()).collect(Collectors.toList());
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Weirdly my text search skipped through these

for (int i = 0; i < limit; i++) {
String partitionPath = partitionPaths[RAND.nextInt(partitionPaths.length)];
String partitionPath = partitionPaths[r.nextInt(partitionPaths.length)];
HoodieKey key = new HoodieKey(UUID.randomUUID().toString(), partitionPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for UUID.randomUUID().

public List<GenericRecord> generateGenericRecords(int numRecords) {
List<GenericRecord> list = new ArrayList<>();
IntStream.range(0, numRecords).forEach(i -> {
list.add(generateGenericRecord(UUID.randomUUID().toString(), "0", UUID.randomUUID().toString(), UUID.randomUUID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here regarding UUID.randomUUID() if full reproducibility is the goal.

rec.put("nation", ByteBuffer.wrap(bytes));
long currentTimeMillis = System.currentTimeMillis();
Date date = new Date(currentTimeMillis);
long randomMillis = genRandomTimeMillis(r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit for thought. Previously, each time generateGenericRecord() is called, the timestamp monotonically increases. If some tests depend on this assumption, the new time millis random gen should also honor that in a way, e.g., start with a random millis, and increment it by another positive random millis for each subsequent call, to guarantee the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an assumption that tests should be able to make. At the end of the day there's no such contract provided by the Generator, and tests should not depend on such implementation detail.

@yihua yihua self-assigned this Feb 21, 2022
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

long currentTimeMillis = System.currentTimeMillis();
Date date = new Date(currentTimeMillis);
long randomMillis = genRandomTimeMillis(r);
Date date = new Date(randomMillis);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: while you're on this, it may worth migrating away from Date to java.time APIs, for standardizing datetime usage in the codebase and be thread-safe where applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no problem. Which one are you referring to? There's no such thing as DateTime in java.time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably LocalDate we need

Comment on lines +899 to +904
bytes[6] &= 0x0f;
bytes[6] |= 0x40;
bytes[8] &= 0x3f;
bytes[8] |= 0x80;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this method deserves some usage guide in javadoc?

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 2, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit 85f47b5 into apache:master Mar 2, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants