-
Notifications
You must be signed in to change notification settings - Fork 749
[Hotfix][GOBBLIN-1949] add option to detect malformed orc during commit #3818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
7ac1f4c
a0ba3b3
0ebce60
b4dee97
c0b4518
0adc622
7be8b80
662ffbb
24c58bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,6 +24,7 @@ | |
| import java.util.Queue; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
|
|
||
| import org.apache.gobblin.util.HadoopUtils; | ||
| import org.apache.hadoop.conf.Configuration; | ||
| import org.apache.hadoop.hive.serde2.SerDeException; | ||
| import org.apache.orc.OrcConf; | ||
|
|
@@ -61,6 +62,7 @@ public abstract class GobblinBaseOrcWriter<S, D> extends FsDataWriter<D> { | |
| protected int batchSize; | ||
| protected final S inputSchema; | ||
|
|
||
| private final boolean validateORCDuringCommit; | ||
| private final boolean selfTuningWriter; | ||
| private int selfTuneRowsBetweenCheck; | ||
| private double rowBatchMemoryUsageFactor; | ||
|
|
@@ -94,6 +96,7 @@ public GobblinBaseOrcWriter(FsDataWriterBuilder<S, D> builder, State properties) | |
| this.inputSchema = builder.getSchema(); | ||
| this.typeDescription = getOrcSchema(); | ||
| this.selfTuningWriter = properties.getPropAsBoolean(GobblinOrcWriterConfigs.ORC_WRITER_AUTO_SELFTUNE_ENABLED, false); | ||
| this.validateORCDuringCommit = properties.getPropAsBoolean(GobblinOrcWriterConfigs.ORC_WRITER_VALIDATE_FILE_DURING_COMMIT, false); | ||
| this.maxOrcBatchSize = properties.getPropAsInt(GobblinOrcWriterConfigs.ORC_WRITER_AUTO_SELFTUNE_MAX_BATCH_SIZE, | ||
| GobblinOrcWriterConfigs.DEFAULT_MAX_ORC_WRITER_BATCH_SIZE); | ||
| this.batchSize = this.selfTuningWriter ? | ||
|
|
@@ -258,7 +261,18 @@ public void close() | |
| public void commit() | ||
| throws IOException { | ||
| closeInternal(); | ||
| if(this.validateORCDuringCommit) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This issue is still present. We want to move this to close function and not just commit because we flush the buffer there too and if it's malformed then we want to catch it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 I think the current issue is caused during the close sequence, so we need to destroy the file
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Im curious whats the validation and how it works. Does it validate on the header or something else.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does all sorts of validations. I attached below 1 example. You can dig around the class for a bunch of others
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, the current issues we see are from writing a bad orc file and then moving it to the taskoutput directory where the file is effectively committed. We do NOT want to modify the behavior of the base data publisher because its such a widely used class with very wide implications. But the current behavior of the base data publisher is to read all the files in the output dir and use runners to move them all in parallel. It has nothing to do with who originally wrote the file, it will blindly move all of them at that point. The base data publisher is not a good place to do validation either because it does not care about the data being moved, it's agnostic to data formats.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case you would like to read more about how it's done at a directory level |
||
| try { | ||
| OrcFile.createReader(this.stagingFile, new OrcFile.ReaderOptions(conf)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment that this is increasing HDFS load because we open again.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also add reader to the try with statement so that it is always closed |
||
| } catch (Exception e) { | ||
| log.error("Found error when validating ORC file during commit phase", e); | ||
| HadoopUtils.deletePath(this.fs, this.stagingFile, false); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to do this validation here or if its being published at a folder level at a different step, is when we have the reader validate all the files going into publish? My concern is that even if we are validating and deleting files at the writer step, this still assumes that this writer will shut down using a happy path and not suddenly. If it shuts down due to some container error code e.g. 143 and not go through the cleanup process then it can still be moved to another folder later down the line? |
||
| log.error("Delete the malformed ORC file after close the writer: {}", this.stagingFile); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this an 'info' log? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also make the statement past tense.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is more an error situation because ideally it shouldn't happen, so we need to delete the file and terminate the commit/ingestion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Line 269 is error message, agreed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it's an info as we delete the file. But we can log the message at the ERROR level in case user disabled for info level, as this is critical information we'd like it to speak loud. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if thats the case you can modify the error message to be like at line 269. "Found error when validating ORC file during commit phase. Deleting the malformed ORC file and closing the writer." The tense of the statement is important in logs to understand whats being done and what is already done. |
||
| throw e; | ||
| } | ||
| } | ||
| super.commit(); | ||
|
|
||
| if (this.selfTuningWriter) { | ||
| properties.setProp(GobblinOrcWriterConfigs.RuntimeStateConfigs.ORC_WRITER_ESTIMATED_RECORD_SIZE, String.valueOf(getEstimatedRecordSizeBytes())); | ||
| properties.setProp(GobblinOrcWriterConfigs.RuntimeStateConfigs.ORC_WRITER_ESTIMATED_BYTES_ALLOCATED_CONVERTER_MEMORY, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concern not defaulting to True ?
I feel the validation should be "default". unless i miss something obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are concerns about an extra HDFS call, and what that would do to HDFS load. Internally we will enable it everywhere but we wouldn't want anyone to accidentally start having increased load, so usually we keep things disabled by default for backward compatibility