Skip to content

Conversation

@XuQianJin-Stars
Copy link
Contributor

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@XuQianJin-Stars XuQianJin-Stars changed the title [HUDI-3088] Use Spark 3.2 as default Spark version [WIP][HUDI-3088] Use Spark 3.2 as default Spark version Feb 6, 2022
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 8, 2022
@nsivabalan nsivabalan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Feb 8, 2022
@xushiyan
Copy link
Member

xushiyan commented Feb 8, 2022

@hudi-bot run azure

@apache apache deleted a comment from hudi-bot Feb 8, 2022
return records.stream().map(Helpers::toJsonString).toArray(String[]::new);
}

public static Tuple2<String, String>[] jsonifyRecordsByPartitions(List<HoodieRecord> records, int partitions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java class org.apache.hudi.common.util.collection.Pair is preferred here over Tuple2

public static Tuple2<String, String>[] jsonifyRecordsByPartitions(List<HoodieRecord> records, int partitions) {
Tuple2<String, String>[] data = new Tuple2[records.size()];
for (int i = 0; i < records.size(); i++) {
int key = i % partitions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be better with org.apache.hudi.common.model.HoodieKey#recordKey from each HoodieRecord ?

@xushiyan
Copy link
Member

xushiyan commented Feb 10, 2022

the first job passing..nice progress. there was a recent fix that might help pass some tests in spark 3.2. d971974

// So when returnNullIfNotFound is true, catch this exception.
if (!returnNullIfNotFound) {
throw e;
LOG.warn("Failed to get nested field Value ", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be changed.

@huberylee
Copy link
Contributor

@hudi-bot run azure

MAVEN_CACHE_FOLDER: $(Pipeline.Workspace)/.m2/repository
MAVEN_OPTS: '-Dmaven.repo.local=$(MAVEN_CACHE_FOLDER) -Dcheckstyle.skip=true -Drat.skip=true -Djacoco.skip=true'
SPARK_VERSION: '2.4.4'
SPARK_VERSION: '3.2.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to upgrade to 3.2.1 to align with master version

<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? thought it inherits the same version from the root pom

* Helper class to do common stuff across Avro.
*/
public class HoodieAvroUtils {
private static final Logger LOG = LogManager.getLogger(HoodieAvroUtils.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG is not used. can we revert?

Comment on lines 79 to 94
Object value = record.get(DebeziumConstants.FLATTENED_OP_COL_NAME);
Object value;
if (record.getSchema().getField(DebeziumConstants.FLATTENED_OP_COL_NAME) == null) {
value = null;
} else {
value = record.get(DebeziumConstants.FLATTENED_OP_COL_NAME);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new check does not look clean; it looks like whenever we try to get a column's value, we need to check its nullability? are we able to retrieve the value the without this kind of check

Comment on lines 134 to 136
new Schema.Field("id", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
new Schema.Field("name", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
new Schema.Field("age", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
new Schema.Field("job", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE)
new Schema.Field("id", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
new Schema.Field("name", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
new Schema.Field("age", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
new Schema.Field("job", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should prefer to use org.apache.avro.Schema.Field#NULL_DEFAULT_VALUE whenever we want to set default as null

Comment on lines 173 to 176
@Disabled
@Test
public void testMetadataBootstrapNonpartitionedCOW() throws Exception {
testBootstrapCommon(false, false, EffectiveMode.METADATA_BOOTSTRAP_MODE);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can disable for the whole class instead of annotating each method, and add a note in the annotation like @Disabled("<describe the reason>")

Comment on lines 45 to 46
scala.collection.immutable.List.empty(),
JavaConverters.collectionAsScalaIterableConverter(new ArrayList<String>()).asScala().toSeq(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you need a seq why not scala.collection.immutable.List.empty().toSeq() ?

Comment on lines 197 to 202
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
<scope>provided</scope>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern on the impact here: why we set an old version just for utilities?

cleanupResources();
}

@Disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be helpful to set a reason

<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same avro version question

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Mar 31, 2022
@yihua
Copy link
Contributor

yihua commented Apr 27, 2022

Closing this one, in favor of #5402 which is going to have all changes for Hadoop/Hive/Spark 3.x upgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants