Skip to content

Allow different synthetic column name length limits in JDBC connectors#18984

Merged
hashhar merged 6 commits intomasterfrom
hashhar/more-column-alias
Nov 28, 2023
Merged

Allow different synthetic column name length limits in JDBC connectors#18984
hashhar merged 6 commits intomasterfrom
hashhar/more-column-alias

Conversation

@hashhar
Copy link
Copy Markdown
Member

@hashhar hashhar commented Sep 10, 2023

Description

Follow-up to 5e520fc.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Postgres, Redshift, SQL Server
* Fix possible query failures when join is pushed down.

@cla-bot cla-bot bot added the cla-signed label Sep 10, 2023
@hashhar hashhar marked this pull request as draft September 10, 2023 19:25
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Sep 11, 2023

The last commit showcases a test which fails for Postgres/SqlServer etc which silently truncate long identifiers so query doesn't fail unless the truncated identifiers lead to ambiguity.

i.e. basically we need a situation where trunc(abcd) = abc and trunc(abcde) = abc to test both cases - where long identifiers are disallowed (e.g. Oracle) and where identifiers are silently truncated (e.g. Postgres).

@hashhar hashhar force-pushed the hashhar/more-column-alias branch from cb49b59 to 9d9e30b Compare September 12, 2023 05:46
@hashhar hashhar marked this pull request as ready for review September 12, 2023 05:46
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Sep 12, 2023

@hashhar hashhar force-pushed the hashhar/more-column-alias branch 2 times, most recently from 405ca00 to b2342e0 Compare September 12, 2023 06:21
@dominikzalewski
Copy link
Copy Markdown
Member

That's quite significant change in terms of the number of lines that have changed. I don't think I fully understand from the context the need for that change. Can you provide a business justification in the description of this PR so that I can understand better what the acceptance criteria is?

@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Sep 13, 2023

That's quite significant change in terms of the number of lines that have changed. I don't think I fully understand from the context the need for that change. Can you provide a business justification in the description of this PR so that I can understand better what the acceptance criteria is?

The commit message for each commit describes why that specific commit is needed. Please take the time to read those.

For the overall justification for this PR see the proposed release note - without this change queries fail for Postgres, Redshift and SQL Server connector at the very least.

Also note that the first 3 commits are all refactors to make the 4th commit easier to implement.
The 4th commit is the functional change.

@Override
public OptionalInt getMaxColumnNameLength(ConnectorSession session)
{
try (Connection connection = connectionFactory.openConnection(session)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is copied/pasted in a few places?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I can probably move this into trino-base-jdbc as a default impl. to be shared for drivers which have properly implemented getMaxColumnNameLength.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of sharing code via inheritance. To me it makes more sense to close that common code in a separate class and inject the instance via DI (possibly swapping for something else via modules or annotation mechanism). For me the inheritance-way never worked long term. Not sure what patterns/conclusions WRT inheritance/no-inheritance exist in this project.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of sharing code via inheritance.

@dominikzalewski Me too, but (un)fortunately that was a design decision once framework for JDBC connector was created. That the only thing one should should be extending JdbcClient (and use BaseJdbcClient for convenience) and that should be enough to implement. All SPI and Trino things should be covered by trino-base-jdbc framework.

https://trino.io/docs/current/develop/example-jdbc.html

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted as a helper to BaseJdbClient, each connector still needs to opt-in to this.

See TopNFunction and limitFunction in BaseJdbcClient as the pattern which was followed.

Copy link
Copy Markdown
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed

@Override
public OptionalInt getMaxColumnNameLength(ConnectorSession session)
{
try (Connection connection = connectionFactory.openConnection(session)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of sharing code via inheritance.

@dominikzalewski Me too, but (un)fortunately that was a design decision once framework for JDBC connector was created. That the only thing one should should be extending JdbcClient (and use BaseJdbcClient for convenience) and that should be enough to implement. All SPI and Trino things should be covered by trino-base-jdbc framework.

https://trino.io/docs/current/develop/example-jdbc.html

@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Oct 16, 2023

@kokosing @dominikzalewski @wendigo @nineinchnick this is ready for another round, I've addressed all comments/replied to them.

Copy link
Copy Markdown
Member

@dominikzalewski dominikzalewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. The comments I'm leaving are 'take it or leave it' type of comments.

@VisibleForTesting
static JdbcColumnHandle createSyntheticAggregationColumn(AggregateFunction aggregate, JdbcTypeHandle typeHandle, int nextSyntheticColumnId)
{
// the new column can be max len(SYNTHETIC_COLUMN_NAME_PREFIX) + len(Integer.MAX_VALUE) = 9 + 10 = 19 characters which is small enough to be supported by all databases
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need that comment if a variation of it is already put into the test code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's useful since when reading the source people don't have the test class open side by side.

return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown));
}

@VisibleForTesting
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the other day I had a discussion with @kokosing about the change I wanted to do to get rid of the problem that if we introduce a dependent object to this hierarchy, we wind up with the 'hell' in plugins and backporting that we don't want to come back to. @kokosing was arguing that I should not be making that change upfront claiming the YAGNI principle. Now, to be on the 'good advisory' side, I have to ask this question, probably not to you @hashhar , but to @kokosing. Does this instance of introducing the 'smell' of '@VisibleForTesting' yet again justify making the modification we were discussing last time (replace inheritance with composition).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #18984 (comment)

for base-jdbc we've settled on inheritance since it makes implementing new connectors based on it super duper easy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if i remove the "unit test" and convert to an integration test there won't be this VisibleForTesting needed.

Since the integration test that I came up with (a query with too many aggregation functions so that new aliases get generated) is very "wordy" and also prone to breaking due to planner changes I opted for the unit test + VisibleForTesting.

.setColumnName(columnName)
.setJdbcTypeHandle(typeHandle)
.setColumnType(aggregate.getOutputType())
.setComment(Optional.of("synthetic"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why and 'for what' do we use the comment field. Does it help in any way that we'll put 'synthetic' word in the comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for synthetic columns the comment field would only appear in logs and explain plans. so it's useful to identify those in a plan. no other significance.


OptionalInt getMaxWriteParallelism(ConnectorSession session);

default OptionalInt getMaxColumnNameLength(ConnectorSession session)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking out loud whether this is a good place to introduce this method. Currently JdbcClient has methods like 'list tables', 'list schemas' (which is metadata in some way), but also methods like 'create', 'delete' some 'object'. It really feels like this method is not about 'another metadata', but it's a property of the database engine rather than the data itself.

I started asking myself questions like, if it turns out that the 'get' method to query the database for the limit does not return a proper value and we want to kind of hard code it in a 'we know better' way, then how would we do that say for Oracle 11g that has 30 chars limit and Oracle 12 which has 128 limit. Would we be putting a buch of IF statements into this implementation?

I'm not sure whether it's worth introducing a different 'type' of object to handle database restrictions, but it really feels like we are mixing domains here by introducing this change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be putting a buch of IF statements into this implementation?

Yes, we would. And note that limits aren't the only case where this happens. e.g. rename column syntax differs across MySql versions so MySqlClient#renameColumn has such an impl.

JdbcClient is the "API" for the database. Anything we want to do to remote database is defined here.

public OptionalInt getMaxColumnNameLength(ConnectorSession session)
{
if (maxColumnNameLength != null) {
// According to JavaDoc of DatabaseMetaData#getMaxColumnNameLength a value of 0 signifies that the limit is unknown
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the advantage of modelling it with Java's optional rather than using the technique that is already used in the quoted javadoc (ie. leave a plain int with 0 or negative meaning there's no limit)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very easy to mistake since it requires you to read the javadoc and know that 0 is special.
I went with plain int in the beginning myself but after the first few test failures I changed to the OptionalInt to clarify that empty had a special meaning.

return OptionalInt.of(maxColumnNameLength);
}

try (Connection connection = connectionFactory.openConnection(session)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'get' method to my gut feeling would feel like a 'regular' simple getter. It would be a surprising discovery for me to learn that it actually does I/O.


int maxColumnNameLength = optionalMaxColumnNameLength.getAsInt();
int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length();
verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to go that far to checking that condition? The previous bug report actually proved that for instance for Oracle, it was the DB engine complaining if the alias was too long. So the user experience was that there was a message from a DB engine.

Does it improve that user experience in any way that we will send this 'crypthic' message to the user? Would he/she have a better understanding of the situation if we had that code instead of not having that code at all and relying on the DB engine error reporting?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the user experience was that there was a message from a DB engine.

This does not work in practice. For example for Postgres and Redshift the identifier is silently truncated with no useful error shown to the user and just a generic query failure about "ambiguous columns".

Also Trino is meant to be a uniform interface over your data-sources so where we can we try to provide that uniformity. e.g. CAST('abc' AS varchar(1)) behaves differently in MySQL vs Postgres but Trino will always return same result regardless of what the underlying system is. Same goes for other user-visible aspects.

return OptionalInt.of(getWriteParallelism(session));
}

protected OptionalInt getMaxColumnNameLengthFromDatabaseMetaData(ConnectorSession session)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for 'late notice' and sending this after all the other comments. It only occurred to me to ask this question after I detached myself from the keyboard.

I presume you are not enabling this for all the databases due to 'security reasons', in the sense you want to limit the impact of this change in case something goes wrong?

Is the target state then to have the same call to this method in every Client?

It really feels like what we should do here is detach the code deployment from the rollout (for instance via a feature toggle).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the process I followed:

  • add a test which exposes the problem
  • add a mechanism to customise the lengths for each database since the tests showed there's no single value for all of them
  • re-run the test with the default lenght being unlimited
  • override this method with implementation only in databases where there was a failure

at this point we know (via tests) that the ones without the override do actually support identifiers longer than 65k characters.
we also know that by enabling this path by default for all connectors would incur I/O even in connectors where there's no limit. i.e. we'll ask the DB what the limit is - it'll say no limit. We already know this to be true statically so why incur the I/O and slow down query planning and analysis.

DefaultJdbcMetadata#createSyntheticColumn is meant to only be used to
build the projections for join pushdown and not in other places.

This commit renames the method to createSyntheticJoinProjectionColumn to
clarify this since in the next commit a similar method is being
introduced for aggregation pushdown and the existing name causes
confusion.
In 5e520fc we introduced changes which
prevent a long synthetic column name to be pushed down into remote
database during join pushdown leading to query failures because for
example Oracle 12 and older have a small limit of 30 characters.

The only other pushdown operation which adds synthetic columns in the
pushed down part is aggregation pushdown. However it's not practically
affected since the synthetic aggregation columns are always of the form
`_pfgnrtd_ + nextSyntheticColumnId` which means it's limited to 9 + 10
characters which is small enough to be supported by all databases.
This will be used to customise behaviour of
DefaultJdbcMetadata#createSyntheticJoinProjectionColumn according to specific
connector being used in the next commit.
Before this change we hard-coded the column name length limits to that
of Oracle 11g.

This change allows each connector to provide their own length limits so
that instead of always eagerly truncating synthetic columns we only do
so when required.
In the previous implementation some conditions didn't need to be handled
because the max length limit was known statically which made some
conditions impossible.
This change also applies the generated column name truncation for join
pushdown to Postgres, Redshift and SQL Server.

Postgres and Redshift silently truncate long identifiers unlike Oracle
and SQL Server which fail on long identifiers. However note that there
was no correctness issue because even after silently truncating an
identifier Postgres and Redshift can identify that ambiguous identifiers
exist and fail the query.
@hashhar hashhar force-pushed the hashhar/more-column-alias branch from f3ce795 to 9d3fe7e Compare November 27, 2023 13:47
Copy link
Copy Markdown
Member Author

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased to resolve conflicts

no code changes

unaddressed comments answered

return OptionalInt.of(getWriteParallelism(session));
}

protected OptionalInt getMaxColumnNameLengthFromDatabaseMetaData(ConnectorSession session)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the process I followed:

  • add a test which exposes the problem
  • add a mechanism to customise the lengths for each database since the tests showed there's no single value for all of them
  • re-run the test with the default lenght being unlimited
  • override this method with implementation only in databases where there was a failure

at this point we know (via tests) that the ones without the override do actually support identifiers longer than 65k characters.
we also know that by enabling this path by default for all connectors would incur I/O even in connectors where there's no limit. i.e. we'll ask the DB what the limit is - it'll say no limit. We already know this to be true statically so why incur the I/O and slow down query planning and analysis.

return delegate.getMaxWriteParallelism(session);
}

@Override
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it helps make the next commit's diff much cleaner - no need to pay attention to the interface change when you can just focus on the implementation details which are the ones which need to be actually reviewed.

As before, there are no "hard rules, only guidelines". In this case practically people can not revert this commit without any impact since it's a no-op so it doesn't matter.

return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown));
}

@VisibleForTesting
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #18984 (comment)

for base-jdbc we've settled on inheritance since it makes implementing new connectors based on it super duper easy.

return Optional.of(new AggregationApplicationResult<>(handle, projections.build(), resultAssignments.build(), ImmutableMap.of(), precalculateStatisticsForPushdown));
}

@VisibleForTesting
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if i remove the "unit test" and convert to an integration test there won't be this VisibleForTesting needed.

Since the integration test that I came up with (a query with too many aggregation functions so that new aliases get generated) is very "wordy" and also prone to breaking due to planner changes I opted for the unit test + VisibleForTesting.

.setColumnName(columnName)
.setJdbcTypeHandle(typeHandle)
.setColumnType(aggregate.getOutputType())
.setComment(Optional.of("synthetic"))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for synthetic columns the comment field would only appear in logs and explain plans. so it's useful to identify those in a plan. no other significance.

static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId)
static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId, OptionalInt optionalMaxColumnNameLength)
{
verify(nextSyntheticColumnId >= 0, "nextSyntheticColumnId rolled over and is not monotonically increasing any more");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orElse / orElseGet can only return an object of type T for Optional<T> which is not this case.

There is ifPresent / ifPresentOrElse which leads to not more readable code because of a very large lambda + atomicreference needing to be used since lambdas need assignments to be effectively final:

    @VisibleForTesting
    static JdbcColumnHandle createSyntheticJoinProjectionColumn(JdbcColumnHandle column, int nextSyntheticColumnId, OptionalInt optionalMaxColumnNameLength)
    {
        verify(nextSyntheticColumnId >= 0, "nextSyntheticColumnId rolled over and is not monotonically increasing any more");

        final String separator = "_";
        AtomicReference<JdbcColumnHandle> ret = new AtomicReference<>();
        optionalMaxColumnNameLength.ifPresentOrElse(maxColumnNameLength -> {
                    int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length();
                    verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength);

                    if (nextSyntheticColumnIdLength == maxColumnNameLength) {
                        ret.set(JdbcColumnHandle.builderFrom(column)
                                .setColumnName(String.valueOf(nextSyntheticColumnId))
                                .build());
                    }

                    if (nextSyntheticColumnIdLength + separator.length() == maxColumnNameLength) {
                        ret.set(JdbcColumnHandle.builderFrom(column)
                                .setColumnName(separator + nextSyntheticColumnId)
                                .build());
                    }

                    // fixedLength only accepts values > 0, so the cases where the value would be <= 0 are handled above explicitly
                    String truncatedColumnName = fixedLength(maxColumnNameLength - separator.length() - nextSyntheticColumnIdLength)
                            .split(column.getColumnName())
                            .iterator()
                            .next();
                    ret.set(JdbcColumnHandle.builderFrom(column)
                            .setColumnName(truncatedColumnName + separator + nextSyntheticColumnId)
                            .build());
                },
                () -> JdbcColumnHandle.builderFrom(column)
                        .setColumnName(column.getColumnName() + separator + nextSyntheticColumnId)
                        .build());
        return ret.get();
    }


int maxColumnNameLength = optionalMaxColumnNameLength.getAsInt();
int nextSyntheticColumnIdLength = String.valueOf(nextSyntheticColumnId).length();
verify(maxColumnNameLength >= nextSyntheticColumnIdLength, "Maximum allowed column name length is %s but next synthetic id has length %s", maxColumnNameLength, nextSyntheticColumnIdLength);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the user experience was that there was a message from a DB engine.

This does not work in practice. For example for Postgres and Redshift the identifier is silently truncated with no useful error shown to the user and just a generic query failure about "ambiguous columns".

Also Trino is meant to be a uniform interface over your data-sources so where we can we try to provide that uniformity. e.g. CAST('abc' AS varchar(1)) behaves differently in MySQL vs Postgres but Trino will always return same result regardless of what the underlying system is. Same goes for other user-visible aspects.


OptionalInt getMaxWriteParallelism(ConnectorSession session);

default OptionalInt getMaxColumnNameLength(ConnectorSession session)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be putting a buch of IF statements into this implementation?

Yes, we would. And note that limits aren't the only case where this happens. e.g. rename column syntax differs across MySql versions so MySqlClient#renameColumn has such an impl.

JdbcClient is the "API" for the database. Anything we want to do to remote database is defined here.

@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Nov 28, 2023

merging since there are no remaining code changes requested.

I'll be happy to address answers or post-merge reviews in a follow-up.

@hashhar hashhar merged commit 21c6dcb into master Nov 28, 2023
@hashhar hashhar deleted the hashhar/more-column-alias branch November 28, 2023 12:21
@github-actions github-actions bot added this to the 434 milestone Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants