Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Oct 29, 2021

  • Make HoodieRecord abstract and use HoodieAvroRecord for the implementation for extensibility
  • Make HoodieIndex independent of HoodieRecordPayload

@yihua yihua force-pushed the HUDI-2656-generalize-hoodie-index branch 3 times, most recently from b7989f4 to 47eaf38 Compare October 29, 2021 17:03
@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Nov 3, 2021
@xushiyan xushiyan self-assigned this Nov 7, 2021
@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Nov 15, 2021

package org.apache.hudi.common.model;

public class HoodieAvroRecord<T extends HoodieRecordPayload> extends HoodieRecord<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be HoodieRecordPayload<T>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This is more like a retrofit for Java 7 style and users extending old HoodieRecord. This should be fixed in one shot with new APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua sorry, not sure i understood your point. Can you elaborate?

Why do we want to extend HoodieAvroRecord with format-specific impl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: this will be an intermediate state until RFC-46 is fully implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this is more of an intermediate solution for row writer, before RFC-46 revamps it completely.

import java.io.Serializable;
import java.util.Properties;

public interface HoodieRecordPayloadDelegate<T> extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a java-doc to explain purpose of this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of adding HoodieRecordPayloadDelegate is the same as the recording combining/merging API, i.e., decoupling the merging logic from record payload and adding an independent interface for merging, taking two type-generic instances and returning the result instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this class is not used in this PR and it'll probably be replaced with new API like Mergeable, I'll remove it from this patch.

@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch 2 times, most recently from 8aff109 to af9fb3c Compare January 20, 2022 21:33
@apache apache deleted a comment from hudi-bot Jan 20, 2022
@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch from af9fb3c to fd6f671 Compare January 20, 2022 23:55
@apache apache deleted a comment from hudi-bot Jan 21, 2022
@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch from a9c7cf6 to ff7e9f7 Compare January 22, 2022 06:10
@xushiyan xushiyan changed the title [WIP][HUDI-2656] Generalize HoodieIndex for flexible record data type [HUDI-2656] Generalize HoodieIndex for flexible record data type Jan 22, 2022
@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch from ff7e9f7 to 70ef5f9 Compare January 22, 2022 19:01
Comment on lines -388 to +386
public Map<String, Integer> mapFileWithInsertsToUniquePartition(JavaRDD<WriteStatus> writeStatusRDD) {
Map<String, Integer> mapFileWithInsertsToUniquePartition(JavaRDD<WriteStatus> writeStatusRDD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method only used in test; changing to package access due to using RDD in the signature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't want to depend on Guava we can add our own annotation with the same purpose as @VisibleForTesting

Comment on lines -469 to +467
public Tuple2<Long, Integer> getHBasePutAccessParallelism(final JavaRDD<WriteStatus> writeStatusRDD) {
Tuple2<Long, Integer> getHBasePutAccessParallelism(final JavaRDD<WriteStatus> writeStatusRDD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

*/

package org.apache.hudi.client.functional;
package org.apache.hudi.index.hbase;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to the same package as SparkHoodieHBaseIndex so that some methods with RDD in the signature can be changed to package access

@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch from 70ef5f9 to b466262 Compare January 22, 2022 19:19
HoodieWriteConfig config, HoodieEngineContext context, HoodieTable hoodieTable,
HoodiePairData<String, String> partitionRecordKeyPairs,
HoodieData<ImmutablePair<String, HoodieKey>> fileComparisonPairs,
HoodieData<Pair<String, HoodieKey>> fileComparisonPairs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hide implementation from public API

Comment on lines 59 to 66
@Override
@PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
public <R> HoodieData<HoodieRecord<R>> tagLocation(
HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
HoodieTable hoodieTable) throws HoodieIndexException {
return HoodieList.of(tagLocation(
HoodieList.getList(records.map(record -> (HoodieRecord<T>) record)), context, hoodieTable));
List<HoodieRecord<T>> hoodieRecords = tagLocation(HoodieList.getList(records.map(record -> (HoodieRecord<T>) record)), context, hoodieTable);
return HoodieList.of(hoodieRecords.stream().map(r -> (HoodieRecord<R>) r).collect(Collectors.toList()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua don't seem to find a better way than this double casting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine since this is more like an intermediate fix. Once RFC-46 is fully done, the logic here should be much cleaner without generics.

@yihua yihua self-assigned this Jan 24, 2022
*/
public static HoodieRecord getTaggedRecord(HoodieRecord inputRecord, Option<HoodieRecordLocation> location) {
HoodieRecord record = inputRecord;
HoodieRecord<?> record = inputRecord;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!


package org.apache.hudi.common.model;

public class HoodieAvroRecord<T extends HoodieRecordPayload> extends HoodieRecord<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua sorry, not sure i understood your point. Can you elaborate?

Why do we want to extend HoodieAvroRecord with format-specific impl?

* A Single Record managed by Hoodie.
*/
public class HoodieRecord<T extends HoodieRecordPayload> implements Serializable {
public abstract class HoodieRecord<T> implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua @xushiyan let's chat more on this to make sure we're aligned on the approach going f/w:

I was thinking of keeping this component file-format agnostic and instead make it engine-specific, while refactoring MOR table read-path for efficient querying.

Can you elaborate what's the goal you're striving for w/ HoodieAvroRecord?

P.S. Putting this context in here for somebody who might not be aware of previous conversations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin so the main goal here is to make HoodieRecord independent of HoodieRecordPayload, this will allow HoodieRowRecord to be used by row-writer code path. HoodieAvroRecord is also meant for compatibility with existing codebase and ease the decoupling. Yea we should discuss more about HoodieAvroRecord vs SparkHoodieRecord.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: this will be an intermediate state until RFC-46 is fully implemented

@xushiyan xushiyan force-pushed the HUDI-2656-generalize-hoodie-index branch 2 times, most recently from 455714a to de5b0b9 Compare January 31, 2022 17:35
@PublicAPIMethod(maturity = ApiMaturityLevel.DEPRECATED)
public I tagLocation(I records, HoodieEngineContext context,
HoodieTable<T, I, K, O> hoodieTable) throws HoodieIndexException {
HoodieTable hoodieTable) throws HoodieIndexException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As general rule of thumb, i think we should aim to fix all raw-types occurrences in the code-base.

Can you please fix new instances so that we at least don't add to our debt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin Yes, i plan to make a follow up PR to remove T from HoodieTable, then i can make the interface taking HoodieTable<I, K, O>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan @alexeykudinkin Since this is anyway backward incompatible, should we just remove the deprecated public API methods and get rid of I and O as well? The reason to keep these methods and the generics is to adapt for users extending these APIs. If you want to change the generics, I'd prefer that all such generics changes in relation to all public APIs should get in 0.11.0 release in one shot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecated APIs are going to be removed in a separate PR to keep the scope of this PR limited.

@PublicAPIMethod(maturity = ApiMaturityLevel.DEPRECATED)
public O updateLocation(O writeStatuses, HoodieEngineContext context,
HoodieTable<T, I, K, O> hoodieTable) throws HoodieIndexException {
HoodieTable hoodieTable) throws HoodieIndexException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well

Comment on lines -388 to +386
public Map<String, Integer> mapFileWithInsertsToUniquePartition(JavaRDD<WriteStatus> writeStatusRDD) {
Map<String, Integer> mapFileWithInsertsToUniquePartition(JavaRDD<WriteStatus> writeStatusRDD) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't want to depend on Guava we can add our own annotation with the same purpose as @VisibleForTesting


package org.apache.hudi.common.model;

public class HoodieAvroRecord<T extends HoodieRecordPayload> extends HoodieRecord<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: this will be an intermediate state until RFC-46 is fully implemented

* A Single Record managed by Hoodie.
*/
public class HoodieRecord<T extends HoodieRecordPayload> implements Serializable {
public abstract class HoodieRecord<T> implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: this will be an intermediate state until RFC-46 is fully implemented

Copy link
Contributor Author

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Good job on revamping this! I put up one major call we need to make and a few nits.

@PublicAPIMethod(maturity = ApiMaturityLevel.DEPRECATED)
public I tagLocation(I records, HoodieEngineContext context,
HoodieTable<T, I, K, O> hoodieTable) throws HoodieIndexException {
HoodieTable hoodieTable) throws HoodieIndexException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan @alexeykudinkin Since this is anyway backward incompatible, should we just remove the deprecated public API methods and get rid of I and O as well? The reason to keep these methods and the generics is to adapt for users extending these APIs. If you want to change the generics, I'd prefer that all such generics changes in relation to all public APIs should get in 0.11.0 release in one shot.

&& !recordLocationHoodieKeyPair.get().getRight().getPartitionPath().equals(hoodieRecord.getPartitionPath())) {
// Create an empty record to delete the record in the old partition
HoodieRecord<T> deleteRecord = new HoodieRecord(recordLocationHoodieKeyPair.get().getRight(),
HoodieRecord<R> deleteRecord = new HoodieAvroRecord(recordLocationHoodieKeyPair.get().getRight(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be revisited to make it Avro agnostic later on.

if (config.getGlobalSimpleIndexUpdatePartitionPath() && !(inputRecord.getPartitionPath().equals(partitionPath))) {
// Create an empty record to delete the record in the old partition
HoodieRecord<T> deleteRecord = new HoodieRecord(new HoodieKey(inputRecord.getRecordKey(), partitionPath), new EmptyHoodieRecordPayload());
HoodieRecord<R> deleteRecord = new HoodieAvroRecord(new HoodieKey(inputRecord.getRecordKey(), partitionPath), new EmptyHoodieRecordPayload());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here for revisiting later on.

Comment on lines 47 to +50
@PublicAPIMethod(maturity = ApiMaturityLevel.DEPRECATED)
public abstract List<WriteStatus> updateLocation(List<WriteStatus> writeStatuses,
HoodieEngineContext context,
HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> hoodieTable) throws HoodieIndexException;
HoodieTable hoodieTable) throws HoodieIndexException;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar for engine-specific HoodieIndex classes to remove deprecated API methods altogether (in a separate PR).


package org.apache.hudi.common.model;

public class HoodieAvroRecord<T extends HoodieRecordPayload> extends HoodieRecord<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this is more of an intermediate solution for row writer, before RFC-46 revamps it completely.

@yihua yihua force-pushed the HUDI-2656-generalize-hoodie-index branch from d7a82da to e07767a Compare February 4, 2022 01:44
@apache apache deleted a comment from hudi-bot Feb 4, 2022
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 4, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit b8601a9 into apache:master Feb 4, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants