feat(plugin-hive): Add support for skip_header_line_count and skip_footer_line_count table properties#26446
Conversation
Reviewer's GuideThis PR implements skip_header_line_count and skip_footer_line_count support in the Hive connector by adding metadata definitions and validation, extending the write pipeline to emit header lines for CSV/text formats, and enriching integration tests to cover expected behaviors and error cases. Sequence diagram for CSV/text file write with header line supportsequenceDiagram
participant "RecordFileWriter"
participant "HiveWriteUtils"
participant "TextCSVHeaderWriter"
participant "OutputStream"
"RecordFileWriter"->>"HiveWriteUtils": createRecordWriter(..., textHeaderWriter)
"HiveWriteUtils"->>"TextCSVHeaderWriter": write(compressedOutput, rowSeparatorByte)
"TextCSVHeaderWriter"->>"OutputStream": write header line
"HiveWriteUtils"->>"OutputStream": write data rows
Entity relationship diagram for new table propertieserDiagram
HIVE_TABLE_PROPERTIES {
INTEGER skip_header_line_count
INTEGER skip_footer_line_count
}
HIVE_TABLE_PROPERTIES ||--o| HIVE_METADATA : "used in"
HIVE_METADATA {
STRING skip_header_line_count
STRING skip_footer_line_count
}
Class diagram for new and updated Hive connector classesclassDiagram
class HiveTableProperties {
+CSV_SEPARATOR : String
+CSV_QUOTE : String
+CSV_ESCAPE : String
+SKIP_HEADER_LINE_COUNT : String
+SKIP_FOOTER_LINE_COUNT : String
+getHeaderSkipCount(Map<String, Object>) : Optional<Integer>
+getFooterSkipCount(Map<String, Object>) : Optional<Integer>
}
class HiveMetadata {
+SKIP_HEADER_COUNT_KEY : String
+SKIP_FOOTER_COUNT_KEY : String
+getTableMetadata(...)
+getEmptyTableProperties(...)
+beginCreateTable(...)
+beginInsertInternal(...)
+checkFormatForProperty(...)
}
class HiveWriteUtils {
+createRecordWriter(...)
+createTextCsvFileWriter(...)
}
class TextCSVHeaderWriter {
+serializer : Serializer
+headerType : Type
+fileColumnNames : List<String>
+write(OutputStream, int)
}
class RecordFileWriter {
+recordWriter : RecordWriter
}
HiveTableProperties --> HiveMetadata : used by
HiveWriteUtils --> TextCSVHeaderWriter : uses
RecordFileWriter --> HiveWriteUtils : uses
RecordFileWriter --> TextCSVHeaderWriter : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
- The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
- The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
- The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
- The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.
## Individual Comments
### Comment 1
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:243-177` </location>
<code_context>
+ {
+ String rowSeparatorString = properties.getProperty(serdeConstants.LINE_DELIM, "\n");
+
+ int rowSeparatorByte;
+ try {
+ rowSeparatorByte = Byte.parseByte(rowSeparatorString);
+ }
</code_context>
<issue_to_address>
**issue:** Row separator parsing may not handle multi-byte delimiters correctly.
The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.
</issue_to_address>
### Comment 2
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:255-257` </location>
<code_context>
+ OutputStream compressedOutput = createCompressedStream(conf, output, compress);
+ TextRecordWriter writer = new TextRecordWriter();
+ writer.initialize(compressedOutput, conf);
+ Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count"));
+ if (skipHeaderLine.isPresent()) {
+ if (parseInt(skipHeaderLine.get()) == 1) {
+ textCSVHeaderWriter
+ .orElseThrow(() -> new IllegalArgumentException("TextHeaderWriter must not be empty when skip.header.line.count is set to 1"))
</code_context>
<issue_to_address>
**issue (bug_risk):** Direct use of parseInt on property value may throw NumberFormatException.
Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.
</issue_to_address>
### Comment 3
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:284-290` </location>
<code_context>
+ public void close(boolean abort)
+ throws IOException
+ {
+ writer.close();
+ if (!abort) {
+ length = target.getFileSystem(conf).getFileStatus(target).getLen();
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.
After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.
```suggestion
public void close(boolean abort)
throws IOException
{
writer.close();
if (!abort) {
length = target.getFileSystem(conf).getFileStatus(target).getLen();
} else {
length = -1; // Invalidate length on abort to prevent stale value
}
```
</issue_to_address>
### Comment 4
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java:764-765` </location>
<code_context>
}
+ // Textfile and CSV specific properties
+ getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
+ .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_LINE_COUNT, Integer.valueOf(skipHeaderCount)));
+ getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
+ .ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_LINE_COUNT, Integer.valueOf(skipFooterCount)));
</code_context>
<issue_to_address>
**suggestion:** Inconsistent use of property key types for skip header/footer count.
Standardize the use of property keys for header/footer counts to prevent confusion and potential errors.
Suggested implementation:
```java
getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
.ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_COUNT_KEY, Integer.valueOf(skipHeaderCount)));
getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
.ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_COUNT_KEY, Integer.valueOf(skipFooterCount)));
```
If `SKIP_HEADER_LINE_COUNT` and `SKIP_FOOTER_LINE_COUNT` are used elsewhere in the codebase, you should refactor those usages to use `SKIP_HEADER_COUNT_KEY` and `SKIP_FOOTER_COUNT_KEY` for consistency. Also, ensure that any documentation or comments referring to these keys are updated accordingly.
</issue_to_address>
### Comment 5
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6862-292` </location>
<code_context>
+ private void testCreateTableWithHeaderAndFooter(String format)
</code_context>
<issue_to_address>
**suggestion (testing):** Missing tests for skip_header_line_count and skip_footer_line_count with value zero and with values greater than supported.
Add tests for skip_header_line_count and skip_footer_line_count set to zero and values above the supported range to confirm correct validation behavior.
Suggested implementation:
```java
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header (col1 varchar) WITH (format = 'CSV', skip_header_line_count = -1)"))
.hasMessageMatching("Invalid value for skip_header_line_count property: -1");
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = -1)"))
.hasMessageMatching("Invalid value for skip_footer_line_count property: -1");
// Test skip_header_line_count = 0 (boundary value)
assertUpdate("CREATE TABLE test_skip_header_zero (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 0)");
// Test skip_footer_line_count = 0 (boundary value)
assertUpdate("CREATE TABLE test_skip_footer_zero (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 0)");
// Test skip_header_line_count above supported range
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header_high (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 1000000)"))
.hasMessageMatching("Invalid value for skip_header_line_count property: 1000000");
// Test skip_footer_line_count above supported range
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer_high (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 1000000)"))
.hasMessageMatching("Invalid value for skip_footer_line_count property: 1000000");
}
```
If the supported range for skip_header_line_count and skip_footer_line_count is different, adjust the value `1000000` to the correct upper bound + 1 for your codebase.
If zero is not a valid value, change `assertUpdate` to `assertThatThrownBy` and check for the correct error message.
</issue_to_address>
### Comment 6
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6916-6924` </location>
<code_context>
+ ") AS SELECT CAST(1 AS VARCHAR) AS col_name1, CAST(2 AS VARCHAR) as col_name2",
+ catalog, schema, name, format);
+
+ assertUpdate(createTableSql, 1);
+ assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);
+ MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
+ assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
</code_context>
<issue_to_address>
**suggestion (testing):** Test for header/footer line skipping does not verify actual data skipping behavior.
Please add tests that confirm header/footer lines are skipped by inserting such lines and verifying that only valid data rows are returned.
```suggestion
assertUpdate(createTableSql, 1);
// Insert a row that simulates a header line (should be skipped)
assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('header1', 'header2')", catalog, schema, format), 1);
// Insert valid data row
assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);
// Query the table and verify that the header line is skipped and only valid data rows are returned
MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
.row("1", "2")
.row("3", "4")
.build()
.getMaterializedRows());
// Additional check: ensure header line is not present in results
List<List<Object>> actualRows = materializedRows.getMaterializedRows();
for (List<Object> row : actualRows) {
assertNotEquals(row, Arrays.asList("header1", "header2"), "Header line should be skipped");
}
assertUpdate(format("DROP TABLE %s_table_skip_header", format));
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| public static RecordWriter createRecordWriter(Path target, JobConf conf, Properties properties, String outputFormatName, ConnectorSession session, Optional<TextCSVHeaderWriter> textCSVHeaderWriter) | ||
| { | ||
| try { |
There was a problem hiding this comment.
issue: Row separator parsing may not handle multi-byte delimiters correctly.
The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.
| Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count")); | ||
| if (skipHeaderLine.isPresent()) { | ||
| if (parseInt(skipHeaderLine.get()) == 1) { |
There was a problem hiding this comment.
issue (bug_risk): Direct use of parseInt on property value may throw NumberFormatException.
Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.
| public void close(boolean abort) | ||
| throws IOException | ||
| { | ||
| writer.close(); | ||
| if (!abort) { | ||
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | ||
| } |
There was a problem hiding this comment.
suggestion (bug_risk): File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.
After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.
| public void close(boolean abort) | |
| throws IOException | |
| { | |
| writer.close(); | |
| if (!abort) { | |
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | |
| } | |
| public void close(boolean abort) | |
| throws IOException | |
| { | |
| writer.close(); | |
| if (!abort) { | |
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | |
| } else { | |
| length = -1; // Invalidate length on abort to prevent stale value | |
| } |
c77d2a9 to
4603e95
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc! A couple of formatting nits, and a question.
6cfa52e to
b6d6578
Compare
imjalpreet
left a comment
There was a problem hiding this comment.
@auden-woolfson, thank you. I have done a partial review, and will continue later.
| } | ||
|
|
||
| // Textfile and CSV specific properties | ||
| Set<HiveStorageFormat> csvAndTextFile = ImmutableSet.of(HiveStorageFormat.TEXTFILE, HiveStorageFormat.CSV); |
There was a problem hiding this comment.
nit: static imports for TEXTFILE and CSV
| if (headerSkipCount > 1) { | ||
| throw new PrestoException(NOT_SUPPORTED, format("Creating Hive table with data with value of %s property greater than 0 is not supported", SKIP_HEADER_COUNT_KEY)); | ||
| } | ||
| }); |
There was a problem hiding this comment.
The error message says greater than 0, but the if condition checks for greater than 1.
| throws IOException | ||
| { | ||
| try { | ||
| ObjectInspector stringObjectInspector = HiveWriteUtils.getRowColumnInspector(headerType); |
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc update! Some formatting nits.
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build. Thanks!
10d522e to
cd625ec
Compare
cd625ec to
c890440
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build. Thanks!
c890440 to
bb1f3f3
Compare
b6576a4 to
5a00503
Compare
imjalpreet
left a comment
There was a problem hiding this comment.
Thanks, @auden-woolfson. I wanted to clarify a couple of things.
| getHeaderSkipCount(tableMetadata.getProperties()).ifPresent(headerSkipCount -> { | ||
| if (headerSkipCount > 1) { | ||
| throw new PrestoException(NOT_SUPPORTED, format("Creating Hive table with data with value of %s property greater than 1 is not supported", SKIP_HEADER_COUNT_KEY)); | ||
| } | ||
| }); | ||
|
|
||
| getFooterSkipCount(tableMetadata.getProperties()).ifPresent(footerSkipCount -> { | ||
| if (footerSkipCount > 0) { | ||
| throw new PrestoException(NOT_SUPPORTED, format("Creating Hive table with data with value of %s property greater than 0 is not supported", SKIP_FOOTER_COUNT_KEY)); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Just to confirm, does this mean that we can create a table(TEXTFILE or CSV) in Presto with the header property but not with the footer property?
There was a problem hiding this comment.
I also feel we should update the error message to
throw new PrestoException(NOT_SUPPORTED, format("CREATE TABLE AS not supported when the value of %s property is greater than 1", SKIP_HEADER_COUNT_KEY))
Similar change for the footer, too.
| createTableSql = format("" + | ||
| "CREATE TABLE %s.%s.%s_table_skip_footer (\n" + | ||
| " \"name\" varchar\n" + | ||
| ")\n" + | ||
| "WITH (\n" + | ||
| " format = '%s',\n" + | ||
| " skip_footer_line_count = 1\n" + | ||
| ")", | ||
| catalog, schema, name, format); | ||
| assertUpdate(createTableSql); | ||
| actual = computeActual(format("SHOW CREATE TABLE %s_table_skip_footer", format)); | ||
| assertEquals(actual.getOnlyValue(), createTableSql); |
There was a problem hiding this comment.
I also feel there is no point in allowing the creation of a table with the footer property greater than 0 if we cannot insert into it later.
We can allow it if it's pointing to already written data, i.e., if external_location is also used along with the footer property.
There was a problem hiding this comment.
Ok, the begin create table function verifies that there is no external location set, so should I just throw an error if skip footer is included at all?
There was a problem hiding this comment.
Yes, let's throw an error if it's not an external table and the SQL includes the footer property.
| createTableSql = format("" + | ||
| "CREATE TABLE %s.%s.%s_table_skip_header_footer (\n" + | ||
| " \"name\" varchar\n" + | ||
| ")\n" + | ||
| "WITH (\n" + | ||
| " format = '%s',\n" + | ||
| " skip_footer_line_count = 1,\n" + | ||
| " skip_header_line_count = 1\n" + | ||
| ")", | ||
| catalog, schema, name, format); | ||
| assertUpdate(createTableSql); | ||
| actual = computeActual(format("SHOW CREATE TABLE %s_table_skip_header_footer", format)); | ||
| assertEquals(actual.getOnlyValue(), createTableSql); |
5a00503 to
26e5e82
Compare
26e5e82 to
1690f37
Compare
imjalpreet
left a comment
There was a problem hiding this comment.
Thanks, @auden-woolfson. Mostly LGTM % minor nits.
Also, please update the PR title and remove the ... at the end
| @Language("SQL") String createFooter = | ||
| format("CREATE TABLE %s.%s.csv_table_skip_footer (\n" + | ||
| " name VARCHAR\n" + | ||
| ")\n" + | ||
| "WITH (\n" + | ||
| " format = 'CSV',\n" + | ||
| " skip_footer_line_count = 1\n" + | ||
| ")", | ||
| catalog, schema); | ||
|
|
||
| assertThatThrownBy(() -> assertUpdate(createFooter)) | ||
| .hasMessageContaining("Cannot create non external table with skip.footer.line.count property"); | ||
|
|
||
| @Language("SQL") String createHeaderFooter = | ||
| format("CREATE TABLE %s.%s.csv_table_skip_header_footer (\n" + | ||
| " name VARCHAR\n" + | ||
| ")\n" + | ||
| "WITH (\n" + | ||
| " format = 'CSV',\n" + | ||
| " skip_footer_line_count = 1,\n" + | ||
| " skip_header_line_count = 1\n" + | ||
| ")", | ||
| catalog, schema); | ||
|
|
||
| assertThatThrownBy(() -> assertUpdate(createHeaderFooter)) | ||
| .hasMessageContaining("Cannot create non external table with skip.footer.line.count property"); |
There was a problem hiding this comment.
These are not required as they are already validated in the previous test testCreateTableWithHeaderAndFooter.
Instead, you can add one success test for insert with skip_header_line_count = 1
imjalpreet
left a comment
There was a problem hiding this comment.
Thanks, @auden-woolfson. LGTM.
|
|
||
| assertUpdate(createHeader); | ||
|
|
||
| assertUpdate(format( |
There was a problem hiding this comment.
Can you add another assertion to validate the data after it's inserted?
aa03c75 to
33311dc
Compare
imjalpreet
left a comment
There was a problem hiding this comment.
Thanks, @auden-woolfson. LGTM.
33311dc to
ea84b7e
Compare
ea84b7e to
72076c7
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build.
…hive
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.