-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Allow different synthetic column name length limits in JDBC connectors #18984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
192e5b4
2062448
598ead5
8d093ec
7a76821
9d3fe7e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -560,6 +560,12 @@ public OptionalInt getMaxWriteParallelism(ConnectorSession session) | |
| return delegate.getMaxWriteParallelism(session); | ||
| } | ||
|
|
||
| @Override | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether to me it makes it more readable that this change is introduced in a separate commit. According to the change in 'commit splitting strategy' we were proposing with @kokosing the other day I'm trying to ask the question: If any other change from this PR has to be reverted, do I want this one to stay? To me it feels like it should be squashed with the next commit that actually makes use of that change.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it helps make the next commit's diff much cleaner - no need to pay attention to the interface change when you can just focus on the implementation details which are the ones which need to be actually reviewed. As before, there are no "hard rules, only guidelines". In this case practically people can not revert this commit without any impact since it's a no-op so it doesn't matter. |
||
| public OptionalInt getMaxColumnNameLength(ConnectorSession session) | ||
|
hashhar marked this conversation as resolved.
|
||
| { | ||
| return delegate.getMaxColumnNameLength(session); | ||
| } | ||
|
|
||
| public void onDataChanged(SchemaTableName table) | ||
| { | ||
| invalidateAllIf(statisticsCache, key -> key.mayReference(table)); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -107,8 +107,6 @@ | |
| public class DefaultJdbcMetadata | ||
| implements JdbcMetadata | ||
| { | ||
| public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30; | ||
|
|
||
| private static final String SYNTHETIC_COLUMN_NAME_PREFIX = "_pfgnrtd_"; | ||
| private static final String DELETE_ROW_ID = "_trino_artificial_column_handle_for_delete_row_id_"; | ||
| private static final String MERGE_ROW_ID = "$merge_row_id"; | ||
|
|
@@ -391,19 +389,13 @@ public Optional<AggregationApplicationResult<ConnectorTableHandle>> applyAggrega | |
| return Optional.empty(); | ||
|
hashhar marked this conversation as resolved.
|
||
| } | ||
|
|
||
| String columnName = SYNTHETIC_COLUMN_NAME_PREFIX + nextSyntheticColumnId; | ||
| JdbcColumnHandle newColumn = createSyntheticAggregationColumn(aggregate, expression.get().getJdbcTypeHandle(), nextSyntheticColumnId); | ||
| nextSyntheticColumnId++; | ||
| JdbcColumnHandle newColumn = JdbcColumnHandle.builder() | ||
| .setColumnName(columnName) | ||
| .setJdbcTypeHandle(expression.get().getJdbcTypeHandle()) | ||
| .setColumnType(aggregate.getOutputType()) | ||
| .setComment(Optional.of("synthetic")) | ||
| .build(); | ||
|
|
||
| newColumns.add(newColumn); | ||
| projections.add(new Variable(newColumn.getColumnName(), aggregate.getOutputType())); | ||
| resultAssignments.add(new Assignment(newColumn.getColumnName(), newColumn, aggregate.getOutputType())); | ||
| expressions.put(columnName, new ParameterizedExpression(expression.get().getExpression(), expression.get().getParameters())); | ||
| expressions.put(newColumn.getColumnName(), new ParameterizedExpression(expression.get().getExpression(), expression.get().getParameters())); | ||
| } | ||
|
|
||
| List<JdbcColumnHandle> newColumnsList = newColumns.build(); | ||
|
|
@@ -431,6 +423,19 @@ public Optional<AggregationApplicationResult<ConnectorTableHandle>> applyAggrega | |
| return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown)); | ||
| } | ||
|
|
||
| @VisibleForTesting | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the other day I had a discussion with @kokosing about the change I wanted to do to get rid of the problem that if we introduce a dependent object to this hierarchy, we wind up with the 'hell' in plugins and backporting that we don't want to come back to. @kokosing was arguing that I should not be making that change upfront claiming the YAGNI principle. Now, to be on the 'good advisory' side, I have to ask this question, probably not to you @hashhar , but to @kokosing. Does this instance of introducing the 'smell' of '@VisibleForTesting' yet again justify making the modification we were discussing last time (replace inheritance with composition).
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see #18984 (comment) for base-jdbc we've settled on inheritance since it makes implementing new connectors based on it super duper easy.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also if i remove the "unit test" and convert to an integration test there won't be this Since the integration test that I came up with (a query with too many aggregation functions so that new aliases get generated) is very "wordy" and also prone to breaking due to planner changes I opted for the unit test + VisibleForTesting. |
||
| static JdbcColumnHandle createSyntheticAggregationColumn(AggregateFunction aggregate, JdbcTypeHandle typeHandle, int nextSyntheticColumnId) | ||
|
hashhar marked this conversation as resolved.
|
||
| { | ||
| // the new column can be max len(SYNTHETIC_COLUMN_NAME_PREFIX) + len(Integer.MAX_VALUE) = 9 + 10 = 19 characters which is small enough to be supported by all databases | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need that comment if a variation of it is already put into the test code?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think it's useful since when reading the source people don't have the test class open side by side. |
||
| String columnName = SYNTHETIC_COLUMN_NAME_PREFIX + nextSyntheticColumnId; | ||
| return JdbcColumnHandle.builder() | ||
| .setColumnName(columnName) | ||
| .setJdbcTypeHandle(typeHandle) | ||
| .setColumnType(aggregate.getOutputType()) | ||
| .setComment(Optional.of("synthetic")) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was wondering why and 'for what' do we use the comment field. Does it help in any way that we'll put 'synthetic' word in the comment?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for synthetic columns the comment field would only appear in logs and explain plans. so it's useful to identify those in a plan. no other significance. |
||
| .build(); | ||
| } | ||
|
|
||
| @Override | ||
| public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin( | ||
| ConnectorSession session, | ||
|
|
@@ -459,15 +464,16 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin( | |
| int nextSyntheticColumnId = max(leftHandle.getNextSyntheticColumnId(), rightHandle.getNextSyntheticColumnId()); | ||
|
hashhar marked this conversation as resolved.
|
||
|
|
||
| ImmutableMap.Builder<JdbcColumnHandle, JdbcColumnHandle> newLeftColumnsBuilder = ImmutableMap.builder(); | ||
| OptionalInt maxColumnNameLength = jdbcClient.getMaxColumnNameLength(session); | ||
|
hashhar marked this conversation as resolved.
|
||
| for (JdbcColumnHandle column : jdbcClient.getColumns(session, leftHandle)) { | ||
| newLeftColumnsBuilder.put(column, createSyntheticColumn(column, nextSyntheticColumnId)); | ||
| newLeftColumnsBuilder.put(column, createSyntheticJoinProjectionColumn(column, nextSyntheticColumnId, maxColumnNameLength)); | ||
| nextSyntheticColumnId++; | ||
| } | ||
| Map<JdbcColumnHandle, JdbcColumnHandle> newLeftColumns = newLeftColumnsBuilder.buildOrThrow(); | ||
|
|
||
| ImmutableMap.Builder<JdbcColumnHandle, JdbcColumnHandle> newRightColumnsBuilder = ImmutableMap.builder(); | ||
| for (JdbcColumnHandle column : jdbcClient.getColumns(session, rightHandle)) { | ||
| newRightColumnsBuilder.put(column, createSyntheticColumn(column, nextSyntheticColumnId)); | ||
| newRightColumnsBuilder.put(column, createSyntheticJoinProjectionColumn(column, nextSyntheticColumnId, maxColumnNameLength)); | ||
| nextSyntheticColumnId++; | ||
| } | ||
| Map<JdbcColumnHandle, JdbcColumnHandle> newRightColumns = newRightColumnsBuilder.buildOrThrow(); | ||
|
|
@@ -525,20 +531,40 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin( | |
| } | ||
|
|
||
| @VisibleForTesting | ||
| static JdbcColumnHandle createSyntheticColumn(JdbcColumnHandle column, int nextSyntheticColumnId) | ||
| static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId, OptionalInt optionalMaxColumnNameLength) | ||
| { | ||
| verify(nextSyntheticColumnId >= 0, "nextSyntheticColumnId rolled over and is not monotonically increasing any more"); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was wondering, looking at the code you introduced, whether you had a chance in the past to take a look at this presentation by Stuart Marks, who is the person who 'invented' Optionals? The specific fragment of his presentation I'm referring to starts here: https://youtu.be/fBYhtvY19xA?feature=shared&t=611 Perhaps, following his advise there is a way to use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There is |
||
|
|
||
| int sequentialNumberLength = String.valueOf(nextSyntheticColumnId).length(); | ||
| int originalColumnNameLength = DEFAULT_COLUMN_ALIAS_LENGTH - sequentialNumberLength - "_".length(); | ||
| final String separator = "_"; | ||
| if (optionalMaxColumnNameLength.isEmpty()) { | ||
| return JdbcColumnHandle.builderFrom(column) | ||
| .setColumnName(column.getColumnName() + separator + nextSyntheticColumnId) | ||
| .build(); | ||
| } | ||
|
|
||
| int maxColumnNameLength = optionalMaxColumnNameLength.getAsInt(); | ||
| int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length(); | ||
| verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to go that far to checking that condition? The previous bug report actually proved that for instance for Oracle, it was the DB engine complaining if the alias was too long. So the user experience was that there was a message from a DB engine. Does it improve that user experience in any way that we will send this 'crypthic' message to the user? Would he/she have a better understanding of the situation if we had that code instead of not having that code at all and relying on the DB engine error reporting?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This does not work in practice. For example for Postgres and Redshift the identifier is silently truncated with no useful error shown to the user and just a generic query failure about "ambiguous columns". Also Trino is meant to be a uniform interface over your data-sources so where we can we try to provide that uniformity. e.g. |
||
|
|
||
| if (nextSyntheticColumnIdLength == maxColumnNameLength) { | ||
| return JdbcColumnHandle.builderFrom(column) | ||
| .setColumnName(String.valueOf(nextSyntheticColumnId)) | ||
| .build(); | ||
| } | ||
|
|
||
| String columnNameTruncated = fixedLength(originalColumnNameLength) | ||
| if (nextSyntheticColumnIdLength + separator.length() == maxColumnNameLength) { | ||
| return JdbcColumnHandle.builderFrom(column) | ||
| .setColumnName(separator + nextSyntheticColumnId) | ||
| .build(); | ||
| } | ||
|
|
||
| // fixedLength only accepts values > 0, so the cases where the value would be <= 0 are handled above explicitly | ||
| String truncatedColumnName = fixedLength(maxColumnNameLength - separator.length() - nextSyntheticColumnIdLength) | ||
| .split(column.getColumnName()) | ||
| .iterator() | ||
| .next(); | ||
| String columnName = columnNameTruncated + "_" + nextSyntheticColumnId; | ||
| return JdbcColumnHandle.builderFrom(column) | ||
| .setColumnName(columnName) | ||
| .setColumnName(truncatedColumnName + separator + nextSyntheticColumnId) | ||
| .build(); | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -239,4 +239,9 @@ default Optional<TableScanRedirectApplicationResult> getTableScanRedirection(Con | |
| OptionalLong update(ConnectorSession session, JdbcTableHandle handle); | ||
|
|
||
| OptionalInt getMaxWriteParallelism(ConnectorSession session); | ||
|
|
||
| default OptionalInt getMaxColumnNameLength(ConnectorSession session) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm just thinking out loud whether this is a good place to introduce this method. Currently JdbcClient has methods like 'list tables', 'list schemas' (which is metadata in some way), but also methods like 'create', 'delete' some 'object'. It really feels like this method is not about 'another metadata', but it's a property of the database engine rather than the data itself. I started asking myself questions like, if it turns out that the 'get' method to query the database for the limit does not return a proper value and we want to kind of hard code it in a 'we know better' way, then how would we do that say for Oracle 11g that has 30 chars limit and Oracle 12 which has 128 limit. Would we be putting a buch of IF statements into this implementation? I'm not sure whether it's worth introducing a different 'type' of object to handle database restrictions, but it really feels like we are mixing domains here by introducing this change.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we would. And note that limits aren't the only case where this happens. e.g. rename column syntax differs across MySql versions so
|
||
| { | ||
| return OptionalInt.empty(); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for 'late notice' and sending this after all the other comments. It only occurred to me to ask this question after I detached myself from the keyboard.
I presume you are not enabling this for all the databases due to 'security reasons', in the sense you want to limit the impact of this change in case something goes wrong?
Is the target state then to have the same call to this method in every Client?
It really feels like what we should do here is detach the code deployment from the rollout (for instance via a feature toggle).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the process I followed:
at this point we know (via tests) that the ones without the override do actually support identifiers longer than 65k characters.
we also know that by enabling this path by default for all connectors would incur I/O even in connectors where there's no limit. i.e. we'll ask the DB what the limit is - it'll say no limit. We already know this to be true statically so why incur the I/O and slow down query planning and analysis.