Allow different synthetic column name length limits in JDBC connectors#18984
Allow different synthetic column name length limits in JDBC connectors#18984
Conversation
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/CachingJdbcClient.java
Show resolved
Hide resolved
|
The last commit showcases a test which fails for Postgres/SqlServer etc which silently truncate long identifiers so query doesn't fail unless the truncated identifiers lead to ambiguity. i.e. basically we need a situation where |
cb49b59 to
9d9e30b
Compare
plugin/trino-oracle/src/main/java/io/trino/plugin/oracle/OracleClient.java
Outdated
Show resolved
Hide resolved
405ca00 to
b2342e0
Compare
plugin/trino-postgresql/src/main/java/io/trino/plugin/postgresql/PostgreSqlClient.java
Outdated
Show resolved
Hide resolved
|
That's quite significant change in terms of the number of lines that have changed. I don't think I fully understand from the context the need for that change. Can you provide a business justification in the description of this PR so that I can understand better what the acceptance criteria is? |
The commit message for each commit describes why that specific commit is needed. Please take the time to read those. For the overall justification for this PR see the proposed release note - without this change queries fail for Postgres, Redshift and SQL Server connector at the very least. Also note that the first 3 commits are all refactors to make the 4th commit easier to implement. |
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/JdbcModule.java
Outdated
Show resolved
Hide resolved
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/SyntheticColumnHandleBuilder.java
Outdated
Show resolved
Hide resolved
plugin/trino-oracle/src/main/java/io/trino/plugin/oracle/OracleClient.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public OptionalInt getMaxColumnNameLength(ConnectorSession session) | ||
| { | ||
| try (Connection connection = connectionFactory.openConnection(session)) { |
There was a problem hiding this comment.
Looks like this is copied/pasted in a few places?
There was a problem hiding this comment.
Yes. I can probably move this into trino-base-jdbc as a default impl. to be shared for drivers which have properly implemented getMaxColumnNameLength.
There was a problem hiding this comment.
I'm not a big fan of sharing code via inheritance. To me it makes more sense to close that common code in a separate class and inject the instance via DI (possibly swapping for something else via modules or annotation mechanism). For me the inheritance-way never worked long term. Not sure what patterns/conclusions WRT inheritance/no-inheritance exist in this project.
There was a problem hiding this comment.
I'm not a big fan of sharing code via inheritance.
@dominikzalewski Me too, but (un)fortunately that was a design decision once framework for JDBC connector was created. That the only thing one should should be extending JdbcClient (and use BaseJdbcClient for convenience) and that should be enough to implement. All SPI and Trino things should be covered by trino-base-jdbc framework.
There was a problem hiding this comment.
extracted as a helper to BaseJdbClient, each connector still needs to opt-in to this.
See TopNFunction and limitFunction in BaseJdbcClient as the pattern which was followed.
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
plugin/trino-sqlserver/src/main/java/io/trino/plugin/sqlserver/SqlServerClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-oracle/src/main/java/io/trino/plugin/oracle/OracleClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
| @Override | ||
| public OptionalInt getMaxColumnNameLength(ConnectorSession session) | ||
| { | ||
| try (Connection connection = connectionFactory.openConnection(session)) { |
There was a problem hiding this comment.
I'm not a big fan of sharing code via inheritance.
@dominikzalewski Me too, but (un)fortunately that was a design decision once framework for JDBC connector was created. That the only thing one should should be extending JdbcClient (and use BaseJdbcClient for convenience) and that should be enough to implement. All SPI and Trino things should be covered by trino-base-jdbc framework.
b2342e0 to
f3ce795
Compare
|
@kokosing @dominikzalewski @wendigo @nineinchnick this is ready for another round, I've addressed all comments/replied to them. |
dominikzalewski
left a comment
There was a problem hiding this comment.
The code looks good. The comments I'm leaving are 'take it or leave it' type of comments.
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
| @VisibleForTesting | ||
| static JdbcColumnHandle createSyntheticAggregationColumn(AggregateFunction aggregate, JdbcTypeHandle typeHandle, int nextSyntheticColumnId) | ||
| { | ||
| // the new column can be max len(SYNTHETIC_COLUMN_NAME_PREFIX) + len(Integer.MAX_VALUE) = 9 + 10 = 19 characters which is small enough to be supported by all databases |
There was a problem hiding this comment.
do we need that comment if a variation of it is already put into the test code?
There was a problem hiding this comment.
i think it's useful since when reading the source people don't have the test class open side by side.
| return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown)); | ||
| } | ||
|
|
||
| @VisibleForTesting |
There was a problem hiding this comment.
So the other day I had a discussion with @kokosing about the change I wanted to do to get rid of the problem that if we introduce a dependent object to this hierarchy, we wind up with the 'hell' in plugins and backporting that we don't want to come back to. @kokosing was arguing that I should not be making that change upfront claiming the YAGNI principle. Now, to be on the 'good advisory' side, I have to ask this question, probably not to you @hashhar , but to @kokosing. Does this instance of introducing the 'smell' of '@VisibleForTesting' yet again justify making the modification we were discussing last time (replace inheritance with composition).
There was a problem hiding this comment.
see #18984 (comment)
for base-jdbc we've settled on inheritance since it makes implementing new connectors based on it super duper easy.
There was a problem hiding this comment.
also if i remove the "unit test" and convert to an integration test there won't be this VisibleForTesting needed.
Since the integration test that I came up with (a query with too many aggregation functions so that new aliases get generated) is very "wordy" and also prone to breaking due to planner changes I opted for the unit test + VisibleForTesting.
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
| .setColumnName(columnName) | ||
| .setJdbcTypeHandle(typeHandle) | ||
| .setColumnType(aggregate.getOutputType()) | ||
| .setComment(Optional.of("synthetic")) |
There was a problem hiding this comment.
I was wondering why and 'for what' do we use the comment field. Does it help in any way that we'll put 'synthetic' word in the comment?
There was a problem hiding this comment.
for synthetic columns the comment field would only appear in logs and explain plans. so it's useful to identify those in a plan. no other significance.
|
|
||
| OptionalInt getMaxWriteParallelism(ConnectorSession session); | ||
|
|
||
| default OptionalInt getMaxColumnNameLength(ConnectorSession session) |
There was a problem hiding this comment.
I'm just thinking out loud whether this is a good place to introduce this method. Currently JdbcClient has methods like 'list tables', 'list schemas' (which is metadata in some way), but also methods like 'create', 'delete' some 'object'. It really feels like this method is not about 'another metadata', but it's a property of the database engine rather than the data itself.
I started asking myself questions like, if it turns out that the 'get' method to query the database for the limit does not return a proper value and we want to kind of hard code it in a 'we know better' way, then how would we do that say for Oracle 11g that has 30 chars limit and Oracle 12 which has 128 limit. Would we be putting a buch of IF statements into this implementation?
I'm not sure whether it's worth introducing a different 'type' of object to handle database restrictions, but it really feels like we are mixing domains here by introducing this change.
There was a problem hiding this comment.
Would we be putting a buch of IF statements into this implementation?
Yes, we would. And note that limits aren't the only case where this happens. e.g. rename column syntax differs across MySql versions so MySqlClient#renameColumn has such an impl.
JdbcClient is the "API" for the database. Anything we want to do to remote database is defined here.
| public OptionalInt getMaxColumnNameLength(ConnectorSession session) | ||
| { | ||
| if (maxColumnNameLength != null) { | ||
| // According to JavaDoc of DatabaseMetaData#getMaxColumnNameLength a value of 0 signifies that the limit is unknown |
There was a problem hiding this comment.
what's the advantage of modelling it with Java's optional rather than using the technique that is already used in the quoted javadoc (ie. leave a plain int with 0 or negative meaning there's no limit)?
There was a problem hiding this comment.
Very easy to mistake since it requires you to read the javadoc and know that 0 is special.
I went with plain int in the beginning myself but after the first few test failures I changed to the OptionalInt to clarify that empty had a special meaning.
| return OptionalInt.of(maxColumnNameLength); | ||
| } | ||
|
|
||
| try (Connection connection = connectionFactory.openConnection(session)) { |
There was a problem hiding this comment.
The 'get' method to my gut feeling would feel like a 'regular' simple getter. It would be a surprising discovery for me to learn that it actually does I/O.
plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java
Outdated
Show resolved
Hide resolved
|
|
||
| int maxColumnNameLength = optionalMaxColumnNameLength.getAsInt(); | ||
| int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length(); | ||
| verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength); |
There was a problem hiding this comment.
do we need to go that far to checking that condition? The previous bug report actually proved that for instance for Oracle, it was the DB engine complaining if the alias was too long. So the user experience was that there was a message from a DB engine.
Does it improve that user experience in any way that we will send this 'crypthic' message to the user? Would he/she have a better understanding of the situation if we had that code instead of not having that code at all and relying on the DB engine error reporting?
There was a problem hiding this comment.
So the user experience was that there was a message from a DB engine.
This does not work in practice. For example for Postgres and Redshift the identifier is silently truncated with no useful error shown to the user and just a generic query failure about "ambiguous columns".
Also Trino is meant to be a uniform interface over your data-sources so where we can we try to provide that uniformity. e.g. CAST('abc' AS varchar(1)) behaves differently in MySQL vs Postgres but Trino will always return same result regardless of what the underlying system is. Same goes for other user-visible aspects.
| return OptionalInt.of(getWriteParallelism(session)); | ||
| } | ||
|
|
||
| protected OptionalInt getMaxColumnNameLengthFromDatabaseMetaData(ConnectorSession session) |
There was a problem hiding this comment.
Sorry for 'late notice' and sending this after all the other comments. It only occurred to me to ask this question after I detached myself from the keyboard.
I presume you are not enabling this for all the databases due to 'security reasons', in the sense you want to limit the impact of this change in case something goes wrong?
Is the target state then to have the same call to this method in every Client?
It really feels like what we should do here is detach the code deployment from the rollout (for instance via a feature toggle).
There was a problem hiding this comment.
here's the process I followed:
- add a test which exposes the problem
- add a mechanism to customise the lengths for each database since the tests showed there's no single value for all of them
- re-run the test with the default lenght being unlimited
- override this method with implementation only in databases where there was a failure
at this point we know (via tests) that the ones without the override do actually support identifiers longer than 65k characters.
we also know that by enabling this path by default for all connectors would incur I/O even in connectors where there's no limit. i.e. we'll ask the DB what the limit is - it'll say no limit. We already know this to be true statically so why incur the I/O and slow down query planning and analysis.
DefaultJdbcMetadata#createSyntheticColumn is meant to only be used to build the projections for join pushdown and not in other places. This commit renames the method to createSyntheticJoinProjectionColumn to clarify this since in the next commit a similar method is being introduced for aggregation pushdown and the existing name causes confusion.
In 5e520fc we introduced changes which prevent a long synthetic column name to be pushed down into remote database during join pushdown leading to query failures because for example Oracle 12 and older have a small limit of 30 characters. The only other pushdown operation which adds synthetic columns in the pushed down part is aggregation pushdown. However it's not practically affected since the synthetic aggregation columns are always of the form `_pfgnrtd_ + nextSyntheticColumnId` which means it's limited to 9 + 10 characters which is small enough to be supported by all databases.
This will be used to customise behaviour of DefaultJdbcMetadata#createSyntheticJoinProjectionColumn according to specific connector being used in the next commit.
Before this change we hard-coded the column name length limits to that of Oracle 11g. This change allows each connector to provide their own length limits so that instead of always eagerly truncating synthetic columns we only do so when required.
In the previous implementation some conditions didn't need to be handled because the max length limit was known statically which made some conditions impossible.
This change also applies the generated column name truncation for join pushdown to Postgres, Redshift and SQL Server. Postgres and Redshift silently truncate long identifiers unlike Oracle and SQL Server which fail on long identifiers. However note that there was no correctness issue because even after silently truncating an identifier Postgres and Redshift can identify that ambiguous identifiers exist and fail the query.
f3ce795 to
9d3fe7e
Compare
hashhar
left a comment
There was a problem hiding this comment.
rebased to resolve conflicts
no code changes
unaddressed comments answered
| return OptionalInt.of(getWriteParallelism(session)); | ||
| } | ||
|
|
||
| protected OptionalInt getMaxColumnNameLengthFromDatabaseMetaData(ConnectorSession session) |
There was a problem hiding this comment.
here's the process I followed:
- add a test which exposes the problem
- add a mechanism to customise the lengths for each database since the tests showed there's no single value for all of them
- re-run the test with the default lenght being unlimited
- override this method with implementation only in databases where there was a failure
at this point we know (via tests) that the ones without the override do actually support identifiers longer than 65k characters.
we also know that by enabling this path by default for all connectors would incur I/O even in connectors where there's no limit. i.e. we'll ask the DB what the limit is - it'll say no limit. We already know this to be true statically so why incur the I/O and slow down query planning and analysis.
| return delegate.getMaxWriteParallelism(session); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
it helps make the next commit's diff much cleaner - no need to pay attention to the interface change when you can just focus on the implementation details which are the ones which need to be actually reviewed.
As before, there are no "hard rules, only guidelines". In this case practically people can not revert this commit without any impact since it's a no-op so it doesn't matter.
| return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown)); | ||
| } | ||
|
|
||
| @VisibleForTesting |
There was a problem hiding this comment.
see #18984 (comment)
for base-jdbc we've settled on inheritance since it makes implementing new connectors based on it super duper easy.
| return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown)); | ||
| } | ||
|
|
||
| @VisibleForTesting |
There was a problem hiding this comment.
also if i remove the "unit test" and convert to an integration test there won't be this VisibleForTesting needed.
Since the integration test that I came up with (a query with too many aggregation functions so that new aliases get generated) is very "wordy" and also prone to breaking due to planner changes I opted for the unit test + VisibleForTesting.
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java
Show resolved
Hide resolved
| .setColumnName(columnName) | ||
| .setJdbcTypeHandle(typeHandle) | ||
| .setColumnType(aggregate.getOutputType()) | ||
| .setComment(Optional.of("synthetic")) |
There was a problem hiding this comment.
for synthetic columns the comment field would only appear in logs and explain plans. so it's useful to identify those in a plan. no other significance.
| static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId) | ||
| static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId, OptionalInt optionalMaxColumnNameLength) | ||
| { | ||
| verify(nextSyntheticColumnId >= 0, "nextSyntheticColumnId rolled over and is not monotonically increasing any more"); |
There was a problem hiding this comment.
orElse / orElseGet can only return an object of type T for Optional<T> which is not this case.
There is ifPresent / ifPresentOrElse which leads to not more readable code because of a very large lambda + atomicreference needing to be used since lambdas need assignments to be effectively final:
@VisibleForTesting
static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId, OptionalInt optionalMaxColumnNameLength)
{
verify(nextSyntheticColumnId >= 0, "nextSyntheticColumnId rolled over and is not monotonically increasing any more");
final String separator = "_";
AtomicReference<JdbcColumnHandle> ret = new AtomicReference<>();
optionalMaxColumnNameLength.ifPresentOrElse(maxColumnNameLength -> {
int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length();
verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength);
if (nextSyntheticColumnIdLength == maxColumnNameLength) {
ret.set(JdbcColumnHandle.builderFrom(column)
.setColumnName(String.valueOf(nextSyntheticColumnId))
.build());
}
if (nextSyntheticColumnIdLength + separator.length() == maxColumnNameLength) {
ret.set(JdbcColumnHandle.builderFrom(column)
.setColumnName(separator + nextSyntheticColumnId)
.build());
}
// fixedLength only accepts values > 0, so the cases where the value would be <= 0 are handled above explicitly
String truncatedColumnName = fixedLength(maxColumnNameLength - separator.length() - nextSyntheticColumnIdLength)
.split(column.getColumnName())
.iterator()
.next();
ret.set(JdbcColumnHandle.builderFrom(column)
.setColumnName(truncatedColumnName + separator + nextSyntheticColumnId)
.build());
},
() -> JdbcColumnHandle.builderFrom(column)
.setColumnName(column.getColumnName() + separator + nextSyntheticColumnId)
.build());
return ret.get();
}
|
|
||
| int maxColumnNameLength = optionalMaxColumnNameLength.getAsInt(); | ||
| int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length(); | ||
| verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength); |
There was a problem hiding this comment.
So the user experience was that there was a message from a DB engine.
This does not work in practice. For example for Postgres and Redshift the identifier is silently truncated with no useful error shown to the user and just a generic query failure about "ambiguous columns".
Also Trino is meant to be a uniform interface over your data-sources so where we can we try to provide that uniformity. e.g. CAST('abc' AS varchar(1)) behaves differently in MySQL vs Postgres but Trino will always return same result regardless of what the underlying system is. Same goes for other user-visible aspects.
|
|
||
| OptionalInt getMaxWriteParallelism(ConnectorSession session); | ||
|
|
||
| default OptionalInt getMaxColumnNameLength(ConnectorSession session) |
There was a problem hiding this comment.
Would we be putting a buch of IF statements into this implementation?
Yes, we would. And note that limits aren't the only case where this happens. e.g. rename column syntax differs across MySql versions so MySqlClient#renameColumn has such an impl.
JdbcClient is the "API" for the database. Anything we want to do to remote database is defined here.
plugin/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/TestDefaultJdbcMetadata.java
Show resolved
Hide resolved
|
merging since there are no remaining code changes requested. I'll be happy to address answers or post-merge reviews in a follow-up. |
Description
Follow-up to 5e520fc.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: