Core : Repair manifests #2608

szehon-ho · 2021-05-18T21:06:28Z

Took Russell's suggestion, actually its a lot easier to add an option to RewriteManifestsAction (which already is well-distributed). Propose: RepairMode, if == REPAIR_ENTRIES, then goes to FileSystem to read metadata about the file to update the manifest entry before rewrite.

List added to RewriteManifestsAction results, these include the manifest file in which the entries have been repaired, and a list of field-names that have been repaired to summarize.

Next steps:

-repair split offsets (did not implement , as just wanted to get a first version)
-Mode: RepairMode.REMOVE_DELETED_FILES, RepairMode.ADD_NEW_DATA_FILES

szehon-ho · 2021-05-18T21:07:18Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+  /**
+   * Represents a repaired DataFile
+   */
+  public static class RepairedDataFile {


Needs to be public for Spark serialization

Maybe put this comment into class's comment as well, "Needs to be public for Spark serialization"?

szehon-ho · 2021-05-18T22:35:27Z

FYI @aokolnychyi @RussellSpitzer @flyrain

(javadoc failure doesnt look related: Could not GET 'https://plugins.gradle.org/m2/org/jetbrains/kotlin/kotlin-stdlib-jdk8/1.3.50/kotlin-stdlib-jdk8-1.3.50.jar'.)

flyrain

Looks good to me overall.
Beside, we are trying to implement a snapshot integrity checker, which also needs the data file statics checking as an option. In that case, we may NOT need to repair manifests, but to report any mismatch. Getting a human readable diff report is more important in that case. It'd be nice if diff report can be more human readable or at least more extendable, we can add more format afterwards.

flyrain · 2021-05-18T23:59:38Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

  private PartitionSpec spec = null;
  private Predicate<ManifestFile> predicate = manifest -> true;
  private String stagingLocation = null;
+  private RepairMode mode = RepairMode.NONE;


"mode" -> "repairMode", make it more descriptive?

Do we need a list of mode here? IIUC, we are trying to add more modes, and users can select one or many modes.

flyrain · 2021-05-19T06:31:48Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+     * @param conf Hadoop configuration
+     * @param metricsSpec metrics configuration
+     * @param mapping name mapping
+     * @return metrics


The param comments order isn't correct. Do we even need them? The names have told the story.

Pretty sure our checkstyle will be mad if they are missing

Removed the comment

flyrain · 2021-05-19T07:12:14Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+  static Set<String> diff(DataFile first, DataFile second) {
+    Set<String> result = new HashSet<>();
+    if (first.fileSizeInBytes() != second.fileSizeInBytes()) {
+      result.add("file_size_in_bytes");


"file_size_in_bytes" -> FILE_SIZE.name() ?
This applies to the following strings as well.

flyrain · 2021-05-19T07:18:43Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

+          Optional<RepairManifestHelper.RepairedDataFile> repaired =
+                  RepairManifestHelper.repairDataFile(dataFile, broadcastTable.value(), spec, conf.value().value());
+          if (repaired.isPresent()) {
+            repairedColumns.addAll(repaired.get().repairedFields());


Is it readable if we put fields of multiple files in one set? Are we going to distinguish diffs for each data files?

flyrain · 2021-05-19T17:32:55Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+          MetricsConfig.fromProperties(table.properties()), nameMapping));
+
+      DataFile newFile = newDfBuilder.build();
+      Set<String> diff = RepairManifestHelper.diff(file, newFile);


A nit: RepairManifestHelper.diff -> diff

RussellSpitzer · 2021-05-19T18:02:50Z

spark/src/main/java/org/apache/iceberg/actions/RewriteManifestsAction.java

+   * @param mode repair mode
+   * @return this for method chaining
+   */
+  public RewriteManifestsAction repair(org.apache.iceberg.actions.RewriteManifests.RepairMode mode) {


I don't think you need to update this version since it's deprecated

RussellSpitzer · 2021-05-19T18:02:58Z

spark/src/main/java/org/apache/iceberg/actions/RewriteManifestsActionResult.java

 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;

 @Deprecated
 public class RewriteManifestsActionResult {


Also deprecated

RussellSpitzer · 2021-05-19T18:05:52Z

spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

    return this;
  }

+  public SparkDataFile withSpecId(int newSpecId) {


Did you file the proposal to add specID into the datafile? I know i've had a few temporary prs where I thought about doing this. We could change the reader itself so that this is always populated and I think there is an active PR to do this somewhere as well.

RussellSpitzer · 2021-05-19T18:09:35Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

+              manifestEntryDF, targetNumManifests, targetNumManifestEntries);
    }

+    List<ManifestFile> newManifests = repairedManifests.stream().map(


nit:

.stream() .map(RepairManifestHelper.RepairedManifestFile::manifestFile) .collect

szehon-ho · 2021-05-19T18:10:04Z

@flyrain thanks for the review, so to understand, you would prefer a result of Map of individual manifest-entry changes instead of a summary of manifest-files changed? I was thinking that but was fearing it would be too big of a result.

But if we want this way (and extend this to a diff tool), we could change to return a List of something like:

RepairedManifestEntry(ManifestFile parentFile, DataFile entry, List repairedFields)

Is that in line with your thoughts?

flyrain · 2021-05-19T19:42:25Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+
+  /**
+   * Diffs two DataFile for potential for repair
+   * @return a set of fields in human-readable format that differ between these DataFiles


@RussellSpitzer, checkstyle didn't report any issue here but param comments are missing.

Although for one liners I believe we do
Return .....

Without the tag, this is so the summary field gets populated with the full description

flyrain · 2021-05-19T20:00:09Z

@flyrain thanks for the review, so to understand, you would prefer a result of Map of individual manifest-entry changes instead of a summary of manifest-files changed? I was thinking that but was fearing it would be too big of a result.

Yes. Map works here. Your concern is valid. The size varies dramatically. For a table with 1TB data, if the average file size is 256M, we got 1000000/256 = 4,000 data files, we probably needs 100 bytes for each data files, which is about 400M data, that sounds too much to me as well.

In that sense, I'm OK with the current implementation, we can think about the different way to handle the future requirement.

rdblue · 2021-05-19T23:33:05Z

core/src/main/java/org/apache/iceberg/actions/BaseRewriteManifestsActionResult.java

  public BaseRewriteManifestsActionResult(Iterable<ManifestFile> rewrittenManifests,
-                                          Iterable<ManifestFile> addedManifests) {
+                                          Iterable<ManifestFile> addedManifests,
+                                          Iterable<BaseRepairedManifestFile> repairedManifests) {


You probably didn't intend to use the implementation class in this API, right?

rdblue · 2021-05-19T23:33:26Z

core/src/main/java/org/apache/iceberg/actions/BaseRewriteManifestsActionResult.java

+    return (Iterable<RepairedManifest>) repairedManifests;
+  }
+
+  public static class BaseRepairedManifestFile implements RewriteManifests.Result.RepairedManifest {


The implementation class shouldn't be public.

rdblue · 2021-05-19T23:36:29Z

api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java

+   * @param mode repair mode
+   * @return this for method chaining
+   */
+  RewriteManifests repair(RepairMode mode);


I generally prefer to use configuration methods rather than enums in public APIs. That would also allow us to avoid the NONE option and have the actual repair be a bit more specific, like repairFileLengths(). What do you guys think, @szehon-ho, @RussellSpitzer?

rdblue · 2021-05-19T23:41:35Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

  public BaseRewriteManifestsSparkAction(SparkSession spark, Table table) {
    super(spark);
-    this.manifestEncoder = Encoders.javaSerialization(ManifestFile.class);
+    this.manifestEncoder = Encoders.javaSerialization(RepairManifestHelper.RepairedManifestFile.class);


Are these changes actually needed? The RepairedManifestFile interface sends back the fields that were repaired. Is that specific to a file and empty if, for example, the length for all manifest entries were already correct?

I don't see much benefit to doing it that way. If we were to have more specific repair operations, then we don't need that interface at all because we'd already know what manifests fields are being fixed.

OK yes I was trying to think something useful to return to user. But maybe its not very useful as its not specific as per discussion with @flyrain earlier in the review, and returning all patched manifest-entries is a bit overkill. I'm ok for just return the list of repaired ManifestFile

rdblue · 2021-05-19T23:44:08Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

+    SparkDataFile wrapper = new SparkDataFile(dataFileType, sparkType).withSpecId(spec.specId());

    ManifestWriter<DataFile> writer = ManifestFiles.write(format, spec, outputFile, null);
+    Set<String> repairedColumns = new HashSet<String>();


Minor: we use factory methods, like Sets.newHashSet().

rdblue · 2021-05-19T23:44:21Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java

+        SparkDataFile dataFile = wrapper.wrap(file);
+        if (mode == RepairMode.REPAIR_ENTRIES) {
+          Optional<RepairManifestHelper.RepairedDataFile> repaired =
+                  RepairManifestHelper.repairDataFile(dataFile, broadcastTable.value(), spec, conf.value().value());


Nit: indentation is off.

rdblue · 2021-05-19T23:45:47Z

spark/src/main/java/org/apache/iceberg/spark/actions/RepairManifestHelper.java

+     * @param mapping name mapping
+     * @return metrics
+     */
+  private static Metrics getMetrics(FileFormat format, FileStatus status, Configuration conf,


It is rare for an Iceberg method to use get because it almost never aids understanding of what the method does. Unless the method is a getter method on a java bean and the class needs to conform to that convention, we should find a more descriptive verb.

rdblue · 2021-05-19T23:47:45Z

@szehon-ho, @RussellSpitzer, I'm debating whether I agree with the choice to add repair operations to the rewrite manifests action. I think it's very different and significantly more expensive to touch each data file. I think it would make sense for repairs to be done by a repair action. We can share a lot of the implementation, but I think it makes more sense to have an action dedicated to this.

szehon-ho · 2021-05-20T11:40:07Z

Great to see you back @rdblue :)

Yes, after this first implementation, I see some advantages of having dedicated RepairManifestAction. RewriteManifestAction is compaction-oriented, and in so by design it cannot run across two separate partitionSpecs, whereas RepairManifests should be able to do so as it would not combine manifest files.

And yes in general, I see the two can be conceptually different like you said. I can spend some time to look at making this separate action, and refactor common code to the base class.

szehon-ho · 2021-06-21T20:18:02Z

As per the discussion that it deserves to be its own action rather than part of RewriteManifests, completely rewrote RepairManifests to be a separate spark action (BaseRepairManifestsSparkAction), and removed the base logic between it and BaseRewriteManifestAction to base class: BaseManifestSparkAction.

Summary, it distributes the repair, first grouping all entries by ManifestFile, calculating what needs to be repaired for each entry by reading various aspects of the dataFile pointed to by the entry, and writing all the entries back out if any needed repair (the manifest file still retains same number of entries).

Not all logic can be shared. In Repair path, the specId is queried from the original manifest-file , and kept around to write the repaired manifest file (vs passed in).

There is also a problem I noticed, the returned ManifestFiles of RewriteManifests action is wrong if "snapshotIdInheritanceEnabled" is false (as this path rewrites the manifest-file to a final location). So fixed the method while extracting it from BaseRewriteManifestsSparkAction to the new base. (A subsequent change can fix this issue and add a test in RewriteManifests).

@rdblue @aokolnychyi @flyrain @RussellSpitzer if you guys have time for another look

RussellSpitzer · 2021-06-21T22:15:42Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRepairManifestsSparkAction.java

+
+  @Override
+  public Result execute() {
+    String desc = String.format("Rewriting manifests (staging location=%s) of %s", stagingLocation, getTable().name());


Repairing manifests ?

RussellSpitzer · 2021-06-21T22:15:59Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRepairManifestsSparkAction.java

+  @Override
+  public Result execute() {
+    String desc = String.format("Rewriting manifests (staging location=%s) of %s", stagingLocation, getTable().name());
+    JobGroupInfo info = newJobGroupInfo("REWRITE-MANIFESTS", desc);


REPAIR-MANIFESTS

RussellSpitzer · 2021-06-21T22:20:09Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRepairManifestsSparkAction.java

+        .filter("status < 2") // select only live entries
+        .selectExpr("input_file_name() as manifest_path", "snapshot_id", "sequence_number", "data_file");
+
+    return manifestEntryDF.as("manifest_entry")


I still think we should fix this in the actual ManfiestEntry Row and have the spec generated at read time, it feels a little odd to me to be joining to reconnect it when we knew the right ID when we read it the first time. I'm fine with this for now but we really need to get SpecID into that metadata table.

I see, I will file a ticket for that if none exist. What is the policy for adding to metadata table schemas, ie do you know if we have some backward-compatibility policy?

RussellSpitzer · 2021-06-21T22:23:31Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRepairManifestsSparkAction.java

+
+      String manifestName = "repaired-m-" + UUID.randomUUID();
+      Path newManifestPath = new Path(location, manifestName);
+      OutputFile outputFile = io.value().newOutputFile(FileFormat.AVRO.addExtension(newManifestPath.toString()));


would it make sense to parallelize this as well? Using Tasks.foreach?

Do you mean , parallelize this call: writer.existing(e.dataFile, e.snapshotId, e.sequenceNumber)?

It does not seem very multi-thread safe, looking through the writers in the writer stack (ManifestWriter, AvroFileAppender, DataFileWriter, GenericAvroWriter, the Avro writers..) have a lot of variables like counts that aren't safe.

github-actions · 2024-07-17T00:13:20Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-07-24T00:14:00Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

szehon-ho commented May 18, 2021

View reviewed changes

szehon-ho changed the title ~~Repair manifests~~ Core : Repair manifests May 18, 2021

github-actions bot added API core spark labels May 18, 2021

flyrain reviewed May 19, 2021

View reviewed changes

RussellSpitzer reviewed May 19, 2021

View reviewed changes

flyrain reviewed May 19, 2021

View reviewed changes

rdblue reviewed May 19, 2021

View reviewed changes

Szehon Ho added 2 commits June 18, 2021 17:51

Repair manifests

3f5d990

Make repair manifest action

45d0a8e

szehon-ho force-pushed the repair_manifests_apache branch from a85fc18 to 45d0a8e Compare June 21, 2021 19:38

Rename method and remove comments as per previous review.

685581c

Add test for case when no manifests need repair.

059193a

RussellSpitzer reviewed Jun 21, 2021

View reviewed changes

Fix copy-paste errors

4c7fa1b

szehon-ho mentioned this pull request Aug 24, 2021

Add data_file.spec_id to metadata tables #3015

Merged

szehon-ho mentioned this pull request Dec 1, 2021

Regenerate the metadata of an Iceberg table #3639

Closed

szehon-ho mentioned this pull request Jun 4, 2024

Repair manifest action #10445

Closed

github-actions bot added the stale label Jul 17, 2024

github-actions bot closed this Jul 24, 2024

Core : Repair manifests #2608

Core : Repair manifests #2608

Uh oh!

Conversation

szehon-ho commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented May 18, 2021

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 19, 2021

Uh oh!

szehon-ho commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented May 18, 2021 •

edited

Loading

szehon-ho commented May 20, 2021 •

edited

Loading

szehon-ho commented Jun 21, 2021 •

edited

Loading