-
Notifications
You must be signed in to change notification settings - Fork 749
[Hotfix][GOBBLIN-1949] add option to detect malformed orc during commit #3818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
7ac1f4c
a0ba3b3
0ebce60
b4dee97
c0b4518
0adc622
7be8b80
662ffbb
24c58bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -61,6 +61,7 @@ public abstract class GobblinBaseOrcWriter<S, D> extends FsDataWriter<D> { | |
| protected int batchSize; | ||
| protected final S inputSchema; | ||
|
|
||
| private final boolean validateORCDuringCommit; | ||
| private final boolean selfTuningWriter; | ||
| private int selfTuneRowsBetweenCheck; | ||
| private double rowBatchMemoryUsageFactor; | ||
|
|
@@ -94,6 +95,7 @@ public GobblinBaseOrcWriter(FsDataWriterBuilder<S, D> builder, State properties) | |
| this.inputSchema = builder.getSchema(); | ||
| this.typeDescription = getOrcSchema(); | ||
| this.selfTuningWriter = properties.getPropAsBoolean(GobblinOrcWriterConfigs.ORC_WRITER_AUTO_SELFTUNE_ENABLED, false); | ||
| this.validateORCDuringCommit = properties.getPropAsBoolean(GobblinOrcWriterConfigs.ORC_WRITER_VALIDATE_FILE_DURING_COMMIT, false); | ||
| this.maxOrcBatchSize = properties.getPropAsInt(GobblinOrcWriterConfigs.ORC_WRITER_AUTO_SELFTUNE_MAX_BATCH_SIZE, | ||
| GobblinOrcWriterConfigs.DEFAULT_MAX_ORC_WRITER_BATCH_SIZE); | ||
| this.batchSize = this.selfTuningWriter ? | ||
|
|
@@ -259,6 +261,15 @@ public void commit() | |
| throws IOException { | ||
| closeInternal(); | ||
| super.commit(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line calls I also wonder why this is part of the commit step and not part of the close step. close does not call this method, but it does do the flush. If we close and the flushed file turns out not to be valid, we will miss the validation here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking maybe there's something wrong during moving. But given the issue is malformed files, so the issue should already be there after writer closed. So move the logic to after closeInternal() is called. |
||
| if(this.validateORCDuringCommit) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This issue is still present. We want to move this to close function and not just commit because we flush the buffer there too and if it's malformed then we want to catch it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 I think the current issue is caused during the close sequence, so we need to destroy the file
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Im curious whats the validation and how it works. Does it validate on the header or something else.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does all sorts of validations. I attached below 1 example. You can dig around the class for a bunch of others
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, the current issues we see are from writing a bad orc file and then moving it to the taskoutput directory where the file is effectively committed. We do NOT want to modify the behavior of the base data publisher because its such a widely used class with very wide implications. But the current behavior of the base data publisher is to read all the files in the output dir and use runners to move them all in parallel. It has nothing to do with who originally wrote the file, it will blindly move all of them at that point. The base data publisher is not a good place to do validation either because it does not care about the data being moved, it's agnostic to data formats.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case you would like to read more about how it's done at a directory level |
||
| try { | ||
| OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For future readers, I think the below observation is nuanced and worth spelling out. It may even be worth a comment. This will work for files of size [1,3] bytes. It will not catch empty files, which I think is a very very subtle thing. and I think that's okay as long as users are using native ORC readers. Since it seems like part of the standard, computing engines like trino support it https://trino.io/blog/2019/05/29/improved-hive-bucketing.html#whats-the-problem. For some time, presto did not support these empty files but now also does to follow the convention of hive |
||
| } catch (IOException ioException) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From reading the readerimpl, it can throw 2 checked exceptions,
Both of which extend IOException. My question to you is what if there's any other runtime exception? Should we still delete? I lean toward yes. But maybe I am missing some edge case here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we should delete in case of other runtime or unchecked exception, for the sake of robustness. Changed to generic Exception |
||
| log.error("Found error when validating ORC file {} during commit phase", this.outputFile, ioException); | ||
| log.error("Delete the malformed ORC file is successful: {}", this.fs.delete(this.outputFile, false)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the severity of failing to delete this ORC file, do you think we should retry this operation? Check for references to retryer in the code base for an easy out of the box impl
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And is there a world where this operation fails because of
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also notice that the parent class
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. using HadoopUtils.deletePath to delete the file now
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still not sure about the retries. If fs delete file fails, we won't delete the file but also won't retry. This works when we call commit because we throw the IO exception to prevent the file from being moved. But we do not do this when we close the file in the close function, which calls If we flush the buffer, we should check after that the file is valid |
||
| throw ioException; | ||
| } | ||
| } | ||
| if (this.selfTuningWriter) { | ||
| properties.setProp(GobblinOrcWriterConfigs.RuntimeStateConfigs.ORC_WRITER_ESTIMATED_RECORD_SIZE, String.valueOf(getEstimatedRecordSizeBytes())); | ||
| properties.setProp(GobblinOrcWriterConfigs.RuntimeStateConfigs.ORC_WRITER_ESTIMATED_BYTES_ALLOCATED_CONVERTER_MEMORY, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concern not defaulting to True ?
I feel the validation should be "default". unless i miss something obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are concerns about an extra HDFS call, and what that would do to HDFS load. Internally we will enable it everywhere but we wouldn't want anyone to accidentally start having increased load, so usually we keep things disabled by default for backward compatibility