feat: Add TEXTFILE custom serde parameters support#27167
feat: Add TEXTFILE custom serde parameters support#27167aditi-pandit merged 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideAdds Hive TEXTFILE SerDe table properties for field, escape, collection, and map-key delimiters; wires them through HiveMetadata as SerDe parameters instead of regular table parameters; and adds native and Hive integration tests (read/write) that exercise custom TEXTFILE delimiters, including some refactoring of existing TEXTFILE tests. Sequence diagram for creating a TEXTFILE table with custom SerDe delimiterssequenceDiagram
actor User
participant PrestoClient
participant HiveMetadata
participant HiveMetastore
User->>PrestoClient: CREATE TABLE ... WITH (textfile_field_delim = ',')
PrestoClient->>HiveMetadata: createTable(SchemaTableName, ConnectorTableMetadata)
HiveMetadata->>HiveMetadata: getEmptyTableProperties(hiveStorageFormat, tableMetadata, tableProperties)
HiveMetadata->>HiveMetadata: getSingleCharacterProperty(tableProperties, TEXTFILE_FIELD_DELIM)
HiveMetadata->>HiveMetadata: tableProperties.put(TEXTFILE_FIELD_DELIM_KEY, value)
HiveMetadata->>HiveMetadata: extractSerdeParameters(additionalTableParameters)
HiveMetadata->>HiveMetadata: filter out TEXTFILE_SERDE_KEYS from additionalTableParameters
HiveMetadata->>HiveMetastore: createTable(Table with parameters without TEXTFILE_SERDE_KEYS, serdeParameters containing TEXTFILE_SERDE_KEYS)
HiveMetastore-->>HiveMetadata: createTable result
HiveMetadata-->>PrestoClient: table created
PrestoClient-->>User: success
Sequence diagram for reading TEXTFILE SerDe delimiters into table propertiessequenceDiagram
participant PrestoClient
participant HiveMetadata
participant HiveMetastore
PrestoClient->>HiveMetadata: getTableMetadata(SchemaTableName)
HiveMetadata->>HiveMetastore: getTable(SchemaTableName)
HiveMetastore-->>HiveMetadata: Table with serdeParameters
HiveMetadata->>HiveMetadata: getSerdeProperty(table, TEXTFILE_FIELD_DELIM_KEY)
HiveMetadata->>HiveMetadata: properties.put(TEXTFILE_FIELD_DELIM, value)
HiveMetadata->>HiveMetadata: getSerdeProperty(table, TEXTFILE_ESCAPE_DELIM_KEY)
HiveMetadata->>HiveMetadata: properties.put(TEXTFILE_ESCAPE_DELIM, value)
HiveMetadata->>HiveMetadata: getSerdeProperty(table, TEXTFILE_COLLECTION_DELIM_KEY)
HiveMetadata->>HiveMetadata: properties.put(TEXTFILE_COLLECTION_DELIM, value)
HiveMetadata->>HiveMetadata: getSerdeProperty(table, TEXTFILE_MAPKEY_DELIM_KEY)
HiveMetadata->>HiveMetadata: properties.put(TEXTFILE_MAPKEY_DELIM, value)
HiveMetadata-->>PrestoClient: ConnectorTableMetadata with TEXTFILE properties
Updated class diagram for HiveMetadata and HiveTableProperties TEXTFILE SerDe supportclassDiagram
class HiveTableProperties {
<<static>>
+String CSV_SEPARATOR
+String CSV_QUOTE
+String CSV_ESCAPE
+String TEXTFILE_FIELD_DELIM
+String TEXTFILE_MAPKEY_DELIM
+String TEXTFILE_COLLECTION_DELIM
+String TEXTFILE_ESCAPE_DELIM
+HiveTableProperties(TypeManager typeManager, HiveClientConfig config)
+static Optional~Character~ getSingleCharacterProperty(Map~String,Object~ tableProperties, String key)
}
class HiveMetadata {
<<static>>
+String CSV_QUOTE_KEY
+String CSV_ESCAPE_KEY
+String TEXTFILE_FIELD_DELIM_KEY
+String TEXTFILE_ESCAPE_DELIM_KEY
+String TEXTFILE_COLLECTION_DELIM_KEY
+String TEXTFILE_MAPKEY_DELIM_KEY
+Set~String~ TEXTFILE_SERDE_KEYS
+static Map~String,String~ extractSerdeParameters(Map~String,String~ tableParameters)
}
HiveMetadata ..> HiveTableProperties : uses TEXTFILE_* constants
HiveMetadata ..> HiveTableProperties : uses getSingleCharacterProperty
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In HiveMetadata.getTableMetadata() the TEXTFILE serde values are put into the properties map using the serde keys (e.g.,
field.delimvia TEXTFILE_FIELD_DELIM_KEY) rather than the connector property keys defined in HiveTableProperties (e.g.,textfile_field_delim), which will prevent these options from round‑tripping correctly; consider mapping serde keys back to the HiveTableProperties.TEXTFILE_ names instead. - In HiveMetadata.buildTableObject() you compute
serdeParameters = extractSerdeParameters(additionalTableParameters)but then callextractSerdeParametersagain when setting.setSerdeParameters(...); you can reuse the already computedserdeParametersto avoid duplicate work and keep the parameters consistent. - The
serdeParamsstring intestReadTableWithCustomSerdeTextfilemanually embeds a trailing comma and space which are coupled to the SQL format string increateTextFileTableSql; consider havingcreateTextFileTableSqlhandle comma/spacing and passing in only the key–value pairs to make the API less error‑prone.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In HiveMetadata.getTableMetadata() the TEXTFILE serde values are put into the properties map using the serde keys (e.g., `field.delim` via TEXTFILE_FIELD_DELIM_KEY) rather than the connector property keys defined in HiveTableProperties (e.g., `textfile_field_delim`), which will prevent these options from round‑tripping correctly; consider mapping serde keys back to the HiveTableProperties.*TEXTFILE_* names instead.
- In HiveMetadata.buildTableObject() you compute `serdeParameters = extractSerdeParameters(additionalTableParameters)` but then call `extractSerdeParameters` again when setting `.setSerdeParameters(...)`; you can reuse the already computed `serdeParameters` to avoid duplicate work and keep the parameters consistent.
- The `serdeParams` string in `testReadTableWithCustomSerdeTextfile` manually embeds a trailing comma and space which are coupled to the SQL format string in `createTextFileTableSql`; consider having `createTextFileTableSql` handle comma/spacing and passing in only the key–value pairs to make the API less error‑prone.
## Individual Comments
### Comment 1
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java:786-794` </location>
<code_context>
getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
.ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_LINE_COUNT, Integer.valueOf(skipFooterCount)));
+ // Textfile specific properties
+ getSerdeProperty(table.get(), TEXTFILE_FIELD_DELIM_KEY)
+ .ifPresent(fieldDelim -> properties.put(TEXTFILE_FIELD_DELIM_KEY, fieldDelim));
+ getSerdeProperty(table.get(), TEXTFILE_ESCAPE_DELIM_KEY)
+ .ifPresent(escapeDelim -> properties.put(TEXTFILE_ESCAPE_DELIM_KEY, escapeDelim));
+ getSerdeProperty(table.get(), TEXTFILE_COLLECTION_DELIM_KEY)
+ .ifPresent(textCollectionDelim -> properties.put(TEXTFILE_COLLECTION_DELIM_KEY, textCollectionDelim));
+ getSerdeProperty(table.get(), TEXTFILE_MAPKEY_DELIM_KEY)
+ .ifPresent(textMapKeyDelim -> properties.put(TEXTFILE_MAPKEY_DELIM_KEY, textMapKeyDelim));
+
// CSV specific property
</code_context>
<issue_to_address>
**issue (bug_risk):** Textfile properties in table metadata use serde keys instead of Presto table property keys.
For CSV and header/footer, `properties` uses the Presto table property keys (e.g., `CSV_SEPARATOR`, `SKIP_HEADER_LINE_COUNT`). The new Textfile logic instead stores values under the serde keys (e.g., `TEXTFILE_FIELD_DELIM_KEY` / `"field.delim"`) rather than the corresponding table property keys (e.g., `TEXTFILE_FIELD_DELIM` / `"textfile_field_delim"`). As a result, `HiveTableProperties.getSingleCharacterProperty` (which looks up `TEXTFILE_FIELD_DELIM`, etc.) will never see these values. Please switch the `properties.put(...)` calls to use the `HiveTableProperties.TEXTFILE_*` table property constants to keep the round‑trip consistent with the CSV handling above.
</issue_to_address>
### Comment 2
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java:1512-1521` </location>
<code_context>
}
}
+ Map<String, String> serdeParameters = extractSerdeParameters(additionalTableParameters);
+
ImmutableMap.Builder<String, String> tableParameters = ImmutableMap.<String, String>builder()
</code_context>
<issue_to_address>
**suggestion:** Avoid recomputing serdeParameters by reusing the local variable in buildTableObject.
`serdeParameters` is already derived from `additionalTableParameters` and used to filter `tableParameters`, but `setSerdeParameters` recomputes it with `extractSerdeParameters(additionalTableParameters)`. To keep the logic single‑sourced and avoid overhead or divergence, pass the existing `serdeParameters` to `.setSerdeParameters(serdeParameters)` instead.
Suggested implementation:
```java
Map<String, String> serdeParameters = extractSerdeParameters(additionalTableParameters);
ImmutableMap.Builder<String, String> tableParameters = ImmutableMap.<String, String>builder()
.put(PRESTO_VERSION_NAME, prestoVersion)
.put(PRESTO_QUERY_ID_NAME, queryId)
.putAll(additionalTableParameters.entrySet().stream()
.filter(entry -> !serdeParameters.containsKey(entry.getKey()))
.collect(toImmutableMap(Map.Entry::getKey, Map.Entry::getValue)));
if (tableType.equals(EXTERNAL_TABLE)) {
tableParameters.put("EXTERNAL", "TRUE");
.setStorageFormat(fromHiveStorageFormat(hiveStorageFormat))
```
```java
.setTableType(tableType)
.setDataColumns(columns)
.setPartitionColumns(partitionColumns.build())
.setParameters(tableParameters.build())
.setSerdeParameters(serdeParameters)
```
If there are any other locations in this file where `extractSerdeParameters(additionalTableParameters)` is called purely for passing into `setSerdeParameters(...)` within the same method where `serdeParameters` is already computed, those should similarly be replaced with `serdeParameters` to keep the logic single-sourced.
</issue_to_address>
### Comment 3
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:7131-7145` </location>
<code_context>
+ ")",
+ catalog, schema, table, new Path(tempDir.toURI().toASCIIString()));
+ try {
+ assertUpdate(createTableSql);
+
+ assertQuery(
+ format(
+ "SELECT\n" +
</code_context>
<issue_to_address>
**suggestion (testing):** Consider asserting that TEXTFILE serde properties are actually persisted as table properties/serde parameters
The existing read/write tests show that custom delimiters are honored at runtime. To fully validate the metadata aspect of this change, add an assertion that the configured TEXTFILE properties appear in table metadata (for example via `SHOW CREATE TABLE` or `information_schema.table_properties`, consistent with the rest of the suite). This confirms the serde parameters persist correctly in Hive metadata, not just during parsing/serialization.
```suggestion
try {
assertUpdate(createTableSql);
// Verify TEXTFILE serde properties are persisted as table properties in Hive metadata
assertQuery(
format(
"SELECT property_name, property_value " +
"FROM %s.information_schema.table_properties " +
"WHERE table_schema = '%s' " +
" AND table_name = '%s' " +
" AND property_name IN (" +
" 'textfile_field_delim', " +
" 'textfile_collection_delim', " +
" 'textfile_mapkey_delim', " +
" 'textfile_escape_delim') " +
"ORDER BY property_name",
catalog, schema, table),
"VALUES " +
"('textfile_collection_delim', ';'), " +
"('textfile_escape_delim', '\u0001'), " +
"('textfile_field_delim', '|'), " +
"('textfile_mapkey_delim', ':')");
assertQuery(
format(
"SELECT\n" +
"c1, c2, c3, c4, c5, \n" +
"element_at(c6, 'size'), element_at(c6, 'color'), \n" +
"c7.s_arr, element_at(c7.s_map, 10), element_at(c7.s_map, 20) FROM %s.%s.%s", catalog, schema, table),
"VALUES(" +
"1001, 'he|llo', true, 88.5, \n" +
"ARRAY['alpha', 'beta', 'gamma'], \n" +
"'large', 'blue', \n" +
"ARRAY[CAST(1.1 AS REAL), CAST(2.2 AS REAL), CAST(3.3 AS REAL)], 'foo', 'bar')");
}
```
</issue_to_address>
### Comment 4
<location> `presto-docs/src/main/sphinx/connector/hive.rst:315-316` </location>
<code_context>
+======================================================== ============================================================================== =============================
+
+.. note::
+Theses properties are mapped to the corresponding properties in Hive ``LazySerDeParameters`` during serialization and
+follow the same behaviors with ``LazySimpleSerDe``.
+If they are not defined, the Hive defaults are used, which are typically ``\001`` for field delimiter, ``\002`` for
</code_context>
<issue_to_address>
**issue (typo):** Fix typo in "Theses properties" to "These properties".
```suggestion
.. note::
These properties are mapped to the corresponding properties in Hive ``LazySerDeParameters`` during serialization and
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java
Show resolved
Hide resolved
b6e74fa to
5d044c1
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local doc build, looks good. Thanks!
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @xin-zhang2 for this code.
...ecution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java
Show resolved
Hide resolved
|
@tdcmeehan : We need this for the loading phase of the TPC-DS benchmark. Please can you take a look at the Hive changes. |
e110873
|
@xin-zhang2 : Lets also add the csv textfile usage in https://github.com/prestodb/presto/blob/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests/TestTextReaderWithTpcdsQueriesUsingThrift.java as this is required for Tpc-ds benchmark. |
|
@aditi-pandit
Also, csv here is essentially treated as a special case of textfile with custom delimiters, and we have tested that with |
|
@tdcmeehan @steveburnett |
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
@xin-zhang2 : I was looking for a way to get a easy test case to reproduce if there were issues in that usage. We do expect the folks running the benchmarks to use it often. Though we can do that as a follow up. Don't want to block this PR. |
Description
Add serde parameters support for Textfile format for both Presto and Prestissimo. The parameters includes:
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
Summary by Sourcery
Add Hive TEXTFILE SerDe table properties and wire them through Hive metadata so custom delimiters can be used for TEXTFILE tables, with coverage in both Hive and native execution tests.
New Features:
Enhancements:
Tests: