Skip to content

Conversation

@szehon-ho
Copy link
Member

Fixup for #7389

Ref: #7389 (review)

fileGroupsByPartition.put(partition, groups);
}
});
StructLikeMap<List<PositionDeletesScanTask>> filesPerPartition =
Copy link
Contributor

@aokolnychyi aokolnychyi May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: filesPerPartition -> filesByPartition to match other variables?
I also wonder whether we should call it tasksByPartition but that is up to you.

Copy link
Member Author

@szehon-ho szehon-ho May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly its a bit confusing. The API's refer to files(), FileGroups, etc but the class is FileTask. At some point we will need to change from task to file terminology. I end up calling them fileTasks, not sure its the best way.

StructLikeMap<List<List<PositionDeletesScanTask>>> fileGroupsByPartition =
StructLikeMap.create(partitionType);

filesByPartition.forEach(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: It seems we would benefit from an easy way to transform StructLikeMap values. It is used here and below when we construct the context. Would adding something like this to StructLikeMap be an overkill?

public <U> StructLikeMap<U> transformValues(Function<T, U> func) {
  StructLikeMap<U> map = create(type);

  for (Entry<StructLikeWrapper, T> entry : wrapperMap.entrySet()) {
    U newValue = func.apply(entry.getValue());
    map.put(entry.getKey().get(), newValue);
  }

  return map;
}

Then the entire method fileGroupsByPartition can be expressed in one line:

tasksByPartition.transformValues(this::planFileGroups);

...

private List<List<PositionDeletesScanTask>> planFileGroups(List<PositionDeletesScanTask> tasks) {
  return ImmutableList.copyOf(rewriter.planFileGroups(tasks));
}

And RewriteExuctionContext would call:

this.numGroupsByPartition = groupsByPartition.transformValues(List::size);

We would need to filter out empty file group lists but we can do that in toGroupStream. I am not sure it is worth it so I leave it up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, implemented it. I was wondering initially as its not the pattern of the static method like CloseableIterable.transform(), but I think putting on instance level makes the code shorter and even more functional and composable.

}
}

public <U> StructLikeMap<U> transformValues(Function<T, U> func) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your idea of making this a static method also makes sense to me. It would be more in line with Guava and other similar libs. If you decide to make it static, keep in mind the ordering of methods. Static methods should be right after the constructor.

public static <T, U> StructLikeMap<U> transformValues(StructLikeMap<T> map, Function<T, U> func) {
  ...
}

}
}

private Map<StructLike, List<List<PositionDeletesScanTask>>> planFileGroups() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: A note to the above snippet. What about getting rid of commitManager var and just calling the method directly?

if (partialProgressEnabled) {
  return doExecuteWithPartialProgress(ctx, groupStream, commitManager());
} else {
  return doExecute(ctx, groupStream, commitManager());
}


return CloseableIterable.transform(
deletesTable.newBatchScan().ignoreResiduals().planFiles(),
t -> (PositionDeletesScanTask) t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: t -> task?

t -> (PositionDeletesScanTask) t);
}

private StructLikeMap<List<PositionDeletesScanTask>> filesByPartition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: What about calling it groupByPartition?

return scanGroups.stream().map(tasks -> newRewriteGroup(ctx, partition, tasks));
});

return rewriteFileGroupStream.sorted(RewritePositionDeletesGroup.comparator(rewriteJobOrder));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: What about getting rid of the temp var completely?

return groupsByPartition.entrySet().stream()
    .filter(...)
    .flatMap(...)
    .sorted(RewritePositionDeletesGroup.comparator(rewriteJobOrder));

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I like your idea of having transformValues as static but it is up to you. Please, merge whenever you are ready.

Could you also mirror these changes to the action for rewriting data files, @szehon-ho?

@aokolnychyi aokolnychyi added this to the Iceberg 1.3.0 milestone May 16, 2023
@szehon-ho
Copy link
Member Author

Made #7625 to mirror changes in RewriteDataFilesSparkAction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants