Skip to content

Support verifying SELECT queries with DATE or UNKNOWN columns#14266

Merged
caithagoras merged 3 commits intoprestodb:masterfrom
caithagoras:r3
Mar 20, 2020
Merged

Support verifying SELECT queries with DATE or UNKNOWN columns#14266
caithagoras merged 3 commits intoprestodb:masterfrom
caithagoras:r3

Conversation

@caithagoras
Copy link
Contributor

@caithagoras caithagoras commented Mar 19, 2020

== RELEASE NOTES ==

Verifier Changes
* Add support for verifying ``SELECT`` queries that produce ``DATE`` or ``UNKNOWN`` (null) columns.

When executing a query with no result, an artificial QueryResult
object is created but only its QueryStats are needed. Refactor
JdbcPrestoAction to avoid this.
@caithagoras caithagoras changed the title Support verifying SELECT queries with DATE or UNKNOWN columns Support verifying SELECT queries with DATE or UNKNOWN columns Mar 19, 2020
@caithagoras caithagoras force-pushed the r3 branch 5 times, most recently from 1c90f30 to 136f4d0 Compare March 19, 2020 06:57
@mbasmanova
Copy link
Contributor

@caithagoras What is UNKNOWN column and what are the queries that return no results?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caithagoras Improve JdbcPrestoAction commit looks good, but would you given an example of query with no result?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos in the commit message; also, commit message explains how date columns are supported, but doesn't explain how other types (unknown) are supported. Also, what is "unknown" column type?

Support verifying SELECT queries with date columns

SELECT query is rewritten into CreateTableAsSelect so that results of control 
and test queries can be saved into temporary tables before comparing the 
checksums. However, certain columns types, such as date and unknown as well
as complex types (lists, maps and structs) containing those types cannot be written
into a Hive table.

Previously, SELECT queries that generate date columns were skipped 
(the control query would fail). This change supports verifying those queries by 
casting each date column into a timestamp column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is Hive-specific. Does Verifier support only Hive connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the assumption so far. Verifier are unaware about what connector is associated with a given catalog, so it won't be able make decisions around that as well.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caithagoras I believe same problem applies to Geometry type. CC: @jagill

@caithagoras
Copy link
Contributor Author

caithagoras commented Mar 19, 2020

@mbasmanova A Presto unknown type corresponds to a Hive void type, whose value can only be null. For example, SELECT null produces a column of unknown type.

Although being able to verify unknown column is trivial (they always matches), a query may have a mixture of unknown columns and normal columns, and it's better if we verify those queries than skipping them. There are a number of reasons why people want so select multiple null columns: query might be generated, results may be consumed by a script, and so on.

In additional to supporting DATE and UNKNOWN, we'll also be able to support structured type with DATE and UNKNOWN, and I'll do that in a separate PR.

@mbasmanova
Copy link
Contributor

@caithagoras Thanks for explaining. I understand now what is top-level UNKNOWN column, but I'm still puzzled at how list, map, struct may include a field of type UNKNOWN. Would you clarify?

@caithagoras
Copy link
Contributor Author

caithagoras commented Mar 19, 2020

@mbasmanova This is similar with structured type. If array elements are statically null, the element type is unknown, and the array type is array(unknown), which cannot be written into Hive.

explain select array[null, null]
- Output[_col0] => [expr:array(unknown)]

explain select map(array[1], array[null])
- Output[_col0] => [map:map(integer, unknown)]

explain select row(null)
- Output[_col0] => [expr:row(unknown)]

@mbasmanova
Copy link
Contributor

@caithagoras Thank you so much for clarifying.

@caithagoras
Copy link
Contributor Author

caithagoras commented Mar 19, 2020

@mbasmanova Regd Improve JdbcPrestoAction:

When we run a Presto query using jdbc, INSERT INTO query does not have a ResultSet, that is, PrestoStatement.getResultSet returns null, and getUpdateCount returns the number of rows inserted. I haven't checked other query types but I think CREATE TABLE AS should be similar.

Maybe "no result" is not a good wording, as "ignore result" is closer to what is done there.
JdbcPrestoAction provides 2 methods, one that takes ResultSetConverter and returns a QueryResult, and the other does not takes a converter and returns a QueryStats. If caller does not provide a converter, the query will be executed, but the result will be ignored even if the query is a result-producing query.

We do not need results from setup, main, and teardown queries, and we do need results from rewrite, desc, and checksum queries, etc.

SELECT query is rewritten into CreateTableAsSelect so that results of
control and test queries can be saved into temporary tables before
comparing the checksums. However, certain columns types, such as DATE
and UNKNOWN as well as complex types (lists, maps and rows) containing
those types cannot be written into a Hive table.

Previously, SELECT queries that generate DATE columns were skipped
(the control query would fail). This change supports verifying those
queries by casting each DATE column into a TIMESTAMP column.
Previously, SELECT queries that generate UNKNOWN columns were skipped
(the control query would fail). This change supports verifying those
queries by casting each UNKNOWN column into a BIGINT column.
@caithagoras caithagoras merged commit 5078fd6 into prestodb:master Mar 20, 2020
@caithagoras caithagoras deleted the r3 branch March 20, 2020 19:47
@caithagoras caithagoras mentioned this pull request Mar 29, 2020
9 tasks
@caithagoras caithagoras mentioned this pull request May 4, 2020
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants