Skip to content

Add variable precision time and timestamp support#229

Closed
lpoulain wants to merge 4 commits intotrinodb:masterfrom
lpoulain:support-variable-precision-time-timestamp
Closed

Add variable precision time and timestamp support#229
lpoulain wants to merge 4 commits intotrinodb:masterfrom
lpoulain:support-variable-precision-time-timestamp

Conversation

@lpoulain
Copy link
Copy Markdown
Member

Add variable precision time and timestamp support for the experimental_python_types flag by passing the X-Trino-Client-Capabilities: PARAMETRIC_DATETIME header in the query request, just like the Trino JDBC and ODBC drivers do.

Updates the tests/integration/test_types_integration.py accordingly.

.add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \
.add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \
.add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \
.add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be a naive question, but is there a way of reading results with data types with high precision without them getting truncated? Would it require modifying the query to cast them to some other data type or can the driver be configured to return them in some raw form, like strings?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should automatically switch between datetime.time and str depending on the precision.
in JDBC, user needs to explicitly ask for a String to get one (this works for all precisions).
Is it possible with Python cursor API?

Copy link
Copy Markdown
Member

@nineinchnick nineinchnick Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't suggest switching based on precision. There must be an option for users that work with high-precision timestamps, but I guess that can be follow-up work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Note that until recently everything returned by the Python client was untyped and str.

So it's still possible to return str when experimental_python_types is not enabled on the cursor.

But yes, for strongly typed time objects Python's stdlib is very limiting and there's nothing like getObject(idx, clazz) in DB-API.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other database clients do? (for DBs with more than ms precision)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of only Oracle as another database that supports higher than millis precision and they don't expose that to the Python client since stdlib is limited to milliseconds.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL, MySQL support microsecond timestamps
SQL Server, SAP HANA support timestamp with p=7

Copy link
Copy Markdown
Member

@hashhar hashhar Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant micros above. Micros is supported in Python stdlib so no issues with that. I'll see SQL Server and HANA. Thanks for the tip.

EDIT: Yeah, SQL Server seems to truncate extra precision. HANA's driver isn't open source so I can't tell what they do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.


response = self._request.post(self._sql, additional_http_headers)
if self._experimental_python_types:
http_headers = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naive question - how come this doesn't require any other changes in how the client parses results from the server?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the current behavior, without setting this header?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server just starts sending back the precision in the type signatures and the values have additional precision to match.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client capabilities doesn't seem to be documented anywhere, but I found this comment in the server: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/server/protocol/QueryResultRows.java#L355
So looks like it was rounded before in the server, and now it'll get truncated in the client, but still with higher precision (6 vs 3). So it's an improvement.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other Trino clients I've seen (JDBC, ODBC, Go) are all passing X-Trino-Client-Capabilityes: PARAMETRIC_DATETIME to Trino when executing a query. The Python library is the only one that doesn't. Unfortunately, Python datetimes only support a 6-digit precision max.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nineinchnick as far as how come no other change is required on how the client parses the results from the server, it's because in my last PR I made sure that the code worked with either type of result.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trino-go-client doesn't pass that header either: trinodb/trino-go-client#13

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @findepi since he's more familiar with the parametric timestamp changes.

Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Aug 24, 2022

one question - why can this new behaviour not be exposed for cursors without experimental_python_types=true?

SqlTest(trino_connection) \
.add_field(sql="CAST(null AS DATE)", python=None) \
.add_field(sql="DATE '2001-08-22'", python=datetime.date(2001, 8, 22)) \
.add_field(sql="DATE '02001-08-22'", python=datetime.date(2001, 8, 22)) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be considered a bug. Our RowMapper should raise errors for such values which we can't handle properly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 It seems Trino allows this for some reason. @findepi Do you know why?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DATE literals (and varchar cast to date) allow zero padding, surrounding whitespace and year sign (+ or -).
It's however unrelated to the client.
Since DATE '2001-08-22' and DATE '02001-08-22' is just two ways of representing the same value, py client tests don't (and shouldn't) test both.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @lpoulain - let's not test engine level functionality in this test. i.e. the conversion from 02001 to 2001 happens inside Trino so the client shouldn't care at all how or whether it works or not.

It might be more useful to test dates before 1582 (when Julian to Gregorian calendar switch happenned).

  • 0001-01-01
  • 1582-10-04 (before the switch)
  • 1582-10-05 (beginning of switch)
  • 1582-10-14 (end of switch)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove DATE '02001-08-22' from test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Aug 25, 2022

@lpoulain Can you please update behavior to match the JDBC driver (rounding instead of truncation)? Other than that (and possibly allowing using this with experimental_python_types disabled) this seems good to me.

@lpoulain
Copy link
Copy Markdown
Member Author

@hashhar done.

@aalbu aalbu requested a review from hashhar September 13, 2022 22:03
.add_field(sql="CAST('01:23:45.123' AS TIME(3))", python=time_3) \
.add_field(sql="CAST('01:23:45.123456' AS TIME(6))", python=time_6) \
.add_field(sql="CAST('01:23:45.123456789' AS TIME(9))", python=time_6) \
.add_field(sql="CAST('01:23:45.123456789123' AS TIME(12))", python=time_6) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test is passing then it seems we aren't performing the rounding.

.add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \
.add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \
.add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \
.add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.

.add_field(sql="CAST('2001-08-22 01:23:45.123' AS TIMESTAMP(3))", python=timestamp_3) \
.add_field(sql="CAST('2001-08-22 01:23:45.123456' AS TIMESTAMP(6))", python=timestamp_6) \
.add_field(sql="CAST('2001-08-22 01:23:45.123456789' AS TIMESTAMP(9))", python=timestamp_6) \
.add_field(sql="CAST('2001-08-22 01:23:45.123456789123' AS TIMESTAMP(12))", python=timestamp_6) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're actually rounding this should have failed since value read back would be .123457 instead.

trino/client.py Outdated
raise exceptions.TrinoUserError("Query has been cancelled", self.query_id)

if self._experimental_python_types:
new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract constant for header value. Do we do this for other well-named header values as well?

trino/client.py Outdated

if self._experimental_python_types:
new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}
additional_http_headers = new_header | additional_http_headers if additional_http_headers else new_header
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid using dict unions since they overwrite any existing values which can lead to subtle bugs being introduced.

Ideally we'd want to complain if a header whose value was already set in some-place is then tried to be overwritten (either because of bug in our code or because user is trying to access internals of client).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you changed logic in later commits.

@@ -280,12 +280,12 @@ def test_timestamp(trino_connection):


def test_timestamp_with_timezone(trino_connection):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash "Refactor some test function names for consistency" since it changes something introduced in the first commit itself.

trino/client.py Outdated
pattern += ".%f"

dt_size = datetime_default_size + ms_size - ms_to_trim
dt_size = datetime_default_size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace size with length?

dt_size-> timestamp_length

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

trino/client.py Outdated
return lambda val: \
[datetime.strptime(val[:dt_size] + val[dt_tz_offset:], pattern + ' %z')
[datetime.strptime(val[:dt_size]
+ str(round(int(val[dt_size:dt_tz_offset]) / ms_div))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: round returns floats so it'll lose precision and cause incorrect rounding in some cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The round returns a float but the code only cares about the integer part. It takes the total number of digits after the decimal (e.g. 123456789), divides it by ms_div to only keep 6 digits (e.g. 123456.789), rounds this (123457) and converts it to a string. So to lose precision you would need to have a timestamp way more precise than 12 as round(0.499999999999999999) returns 0 and round(0.5000000000000001) return 1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the issue I'm thinking of. Try round(2.675, 2). It can happen for any other value since floats cannot represent all values exactly.

time_0 = datetime.time(1, 23, 45)
time_3 = datetime.time(1, 23, 45, 123000)
time_6 = datetime.time(1, 23, 45, 123456)
time_trunc = datetime.time(1, 23, 45, 123457)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also squash "Round milliseconds instead of truncating when needed" into others.

@lpoulain lpoulain force-pushed the support-variable-precision-time-timestamp branch from 2d2d647 to 79a72f5 Compare September 14, 2022 12:06
@hashhar
Copy link
Copy Markdown
Member

hashhar commented Sep 14, 2022

@lpoulain Please try using fixup commits when addressing review comments otherwise it's impossible to see what changed between the force push and before that.

@mdesmet
Copy link
Copy Markdown
Contributor

mdesmet commented Sep 27, 2022

@lpoulain : can you rebase your changes on latest master?

…l_python_types flag

Add variable precision time and timestamp support for the experimental_python_types flag
@lpoulain lpoulain force-pushed the support-variable-precision-time-timestamp branch from 05a35a8 to adf5394 Compare October 11, 2022 14:39
trino/client.py Outdated

def _get_time_with_timezome(self, value, time_size, pattern):
matches = re.match(r'^(.*)([\+\-])(\d{2}):(\d{2})$', value)
def _get_time_with_timezone_round_ms(self, value, time_size, ms_div, pattern):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguments sounds quite cryptic. Could we name them better?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1,5 +1,8 @@
import math
import datetime
from datetime import timedelta
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full datetime module already imported above

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

trino/client.py Outdated
ms_size, ms_to_trim = self._get_number_of_digits(column)
if ms_size > 0:
pattern += ".%f"
class TimeRowMapperFactory(AbstractTemporalRowMapperFactory):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a value mapper and not a row mapper. Please adjust the name

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed once again.

To be honest the rounding code can use a lot of simplification. I'll propose a PR. In current shape I'm not confident at all in it without a lot of test coverage specific to rounding.

SqlTest(trino_connection) \
.add_field(sql="CAST(null AS DATE)", python=None) \
.add_field(sql="DATE '2001-08-22'", python=date(2001, 8, 22)) \
.add_field(sql="DATE '0001-01-01'", python=date(1, 1, 1)) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea to have the dates sorted and to include a min value and max value so it's obvious what the boundaries are.

Also add a test to verify we get useful errors on querying years < 1 or years > 9999.

time_0 = time(1, 23, 45)
time_3 = time(1, 23, 45, 123000)
time_6 = time(1, 23, 45, 123456)
time_round = time(1, 23, 45, 123457)
Copy link
Copy Markdown
Member

@hashhar hashhar Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only test upwards rounding.

Add cases for:

  • TIME '00:00:00.0000004' -> TIME '00:00:00.000000' (minimal value with downwards rounding)
  • TIME '23:59:59.9999995' -> TIME '00:00:00.000000' (min value with upwards rounding)
  • TIME '23:59:59.9999999' -> TIME '00:00:00.000000' (max value with upwards rounding)
  • TIME '23:59:59.99999949999' -> TIME '23:59:59.999999' (round down)

Zero value with precision preserved

  • TIME '00:00:00' -> TIME '00:00:00'
  • TIME '00:00:00.000000' -> TIME '00:00:00.000000'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for test_time_with_timezone

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the rounding code does not work, and the only examples I could fine involve third party packages such as numpy. So open to suggestions for a better approach

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 23:59:59.999999949999 really supposed to be rounded down when converted to microseconds? Shouldn't 999999.949999 microseconds be rounded up?

Copy link
Copy Markdown
Member

@hashhar hashhar Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I added one more 9 in every case. Edited to correct.

timestamp_0 = datetime(2001, 8, 22, 1, 23, 45, 0)
timestamp_3 = datetime(2001, 8, 22, 1, 23, 45, 123000)
timestamp_6 = datetime(2001, 8, 22, 1, 23, 45, 123456)
timestamp_round = datetime(2001, 8, 22, 1, 23, 45, 123457)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMESTAMP '1970-01-01 00:00:00.12345671' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round down)
TIMESTAMP '1970-01-01 00:00:00.1234567499' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round up trailing but overall value rounds down)
TIMESTAMP '1970-01-01 00:00:00.99999995' -> TIMESTAMP '1970-01-01 00:00:01.0000000' (round up to next second)
TIMESTAMP '1970-01-01 23:59:59.99999995' -> TIMESTAMP '1970-01-02 00:00:00.0000000' (round up to next day)
TIMESTAMP '1969-12-31 23:59:59.99999995' -> TIMESTAMP '1970-01-01 00:00:00.0000000' (round up from before epoch to epoch)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sort according to value + add min and max supported value + verify what error we get for max.

Comment on lines +327 to +330
timestamp_0 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 0)) - delta
timestamp_3 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123000)) - delta
timestamp_6 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123456)) - delta
timestamp_round = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123457)) - delta
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not how timezones work. Depending on exact date the shift value can be different.

i.e. offset = tz.utcoffset(datetime.now()) will give different offsets based on different time of year or dates.

Is there no way to "apply" a timezone to a datetime?

dt_size = datetime_default_size + ms_size - ms_to_trim
dt_tz_offset = datetime_default_size + ms_size
if 'with time zone' in col_type:
class AbstractTemporalValueMapperFactory:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use abc.ABC

args = column['arguments']
if len(args) == 0:
return 3
ms_size = column['arguments'][0]['value']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename ms_size to precision

dt_tz_offset = datetime_default_size + ms_size
if 'with time zone' in col_type:
class AbstractTemporalValueMapperFactory:
def _get_number_of_millis_digits(self, column: Dict[str, Any]) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_precision

return 3
ms_size = column['arguments'][0]['value']
if ms_size == 0:
return -1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why -1 here. Can it be simpler to use datetime_default_size=8 (instead of 9) and adjust get_number_of_millis_digits_to_trim instead?

It's clever but not readable.

datetime_default_size = 9 # size of 'HH:MM:SS.'
time_format = "%H:%M:%S"
millis_length = self._get_number_of_millis_digits(column)
millis_to_trim_div = self._get_number_of_millis_digits_to_trim(column, millis_length)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does _div suffix mean? remove or rename.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It represents the number to divide by to remove the desired number of digits

else:
return self._map_time(time_size, time_format)

def _map_time_timezone_trim_millis_digits(self, datetime_default_size: int, millis_to_trim_div: int,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name says trim (as in truncate) but actual operation is rounding.

@lpoulain
Copy link
Copy Markdown
Member Author

Closing PR as too many things have changed since. Will file a new PR

@lpoulain lpoulain closed this Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

10 participants