Add variable precision time and timestamp support by lpoulain · Pull Request #229 · trinodb/trino-python-client

lpoulain · 2022-08-23T17:21:49Z

Add variable precision time and timestamp support for the experimental_python_types flag by passing the X-Trino-Client-Capabilities: PARAMETRIC_DATETIME header in the query request, just like the Trino JDBC and ODBC drivers do.

Updates the tests/integration/test_types_integration.py accordingly.

nineinchnick · 2022-08-24T11:38:43Z

tests/integration/test_types_integration.py

+        .add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \
+        .add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \
+        .add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \
+        .add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \


That may be a naive question, but is there a way of reading results with data types with high precision without them getting truncated? Would it require modifying the query to cast them to some other data type or can the driver be configured to return them in some raw form, like strings?

i don't think we should automatically switch between datetime.time and str depending on the precision.
in JDBC, user needs to explicitly ask for a String to get one (this works for all precisions).
Is it possible with Python cursor API?

I didn't suggest switching based on precision. There must be an option for users that work with high-precision timestamps, but I guess that can be follow-up work.

@findepi Note that until recently everything returned by the Python client was untyped and str.

So it's still possible to return str when experimental_python_types is not enabled on the cursor.

But yes, for strongly typed time objects Python's stdlib is very limiting and there's nothing like getObject(idx, clazz) in DB-API.

What other database clients do? (for DBs with more than ms precision)

I know of only Oracle as another database that supports higher than millis precision and they don't expose that to the Python client since stdlib is limited to milliseconds.

PostgreSQL, MySQL support microsecond timestamps
SQL Server, SAP HANA support timestamp with p=7

Sorry I meant micros above. Micros is supported in Python stdlib so no issues with that. I'll see SQL Server and HANA. Thanks for the tip.

EDIT: Yeah, SQL Server seems to truncate extra precision. HANA's driver isn't open source so I can't tell what they do.

@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.

nineinchnick · 2022-08-24T11:39:22Z

trino/client.py


-        response = self._request.post(self._sql, additional_http_headers)
+        if self._experimental_python_types:
+            http_headers = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}


Another naive question - how come this doesn't require any other changes in how the client parses results from the server?

What's the current behavior, without setting this header?

The server just starts sending back the precision in the type signatures and the values have additional precision to match.

The client capabilities doesn't seem to be documented anywhere, but I found this comment in the server: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/server/protocol/QueryResultRows.java#L355
So looks like it was rounded before in the server, and now it'll get truncated in the client, but still with higher precision (6 vs 3). So it's an improvement.

All the other Trino clients I've seen (JDBC, ODBC, Go) are all passing X-Trino-Client-Capabilityes: PARAMETRIC_DATETIME to Trino when executing a query. The Python library is the only one that doesn't. Unfortunately, Python datetimes only support a 6-digit precision max.

@nineinchnick as far as how come no other change is required on how the client parses the results from the server, it's because in my last PR I made sure that the code worked with either type of result.

trino-go-client doesn't pass that header either: trinodb/trino-go-client#13

hashhar

cc: @findepi since he's more familiar with the parametric timestamp changes.

findepi

lgtm

hashhar · 2022-08-24T13:00:37Z

one question - why can this new behaviour not be exposed for cursors without experimental_python_types=true?

hashhar · 2022-08-24T13:01:31Z

tests/integration/test_types_integration.py

+    SqlTest(trino_connection) \
+        .add_field(sql="CAST(null AS DATE)", python=None) \
+        .add_field(sql="DATE '2001-08-22'", python=datetime.date(2001, 8, 22)) \
+        .add_field(sql="DATE '02001-08-22'", python=datetime.date(2001, 8, 22)) \


This should be considered a bug. Our RowMapper should raise errors for such values which we can't handle properly.

🤯 It seems Trino allows this for some reason. @findepi Do you know why?

DATE literals (and varchar cast to date) allow zero padding, surrounding whitespace and year sign (+ or -).
It's however unrelated to the client.
Since DATE '2001-08-22' and DATE '02001-08-22' is just two ways of representing the same value, py client tests don't (and shouldn't) test both.

ping @lpoulain - let's not test engine level functionality in this test. i.e. the conversion from 02001 to 2001 happens inside Trino so the client shouldn't care at all how or whether it works or not.

It might be more useful to test dates before 1582 (when Julian to Gregorian calendar switch happenned).

0001-01-01

1582-10-04 (before the switch)

1582-10-05 (beginning of switch)

1582-10-14 (end of switch)

reminder to remove DATE '02001-08-22' from test.

tests/integration/test_types_integration.py

hashhar · 2022-08-25T14:45:44Z

@lpoulain Can you please update behavior to match the JDBC driver (rounding instead of truncation)? Other than that (and possibly allowing using this with experimental_python_types disabled) this seems good to me.

lpoulain · 2022-08-26T11:45:26Z

@hashhar done.

tests/integration/test_types_integration.py

hashhar · 2022-09-14T06:24:51Z

tests/integration/test_types_integration.py

+        .add_field(sql="CAST('01:23:45.123' AS TIME(3))", python=time_3) \
+        .add_field(sql="CAST('01:23:45.123456' AS TIME(6))", python=time_6) \
+        .add_field(sql="CAST('01:23:45.123456789' AS TIME(9))", python=time_6) \
+        .add_field(sql="CAST('01:23:45.123456789123' AS TIME(12))", python=time_6) \


If this test is passing then it seems we aren't performing the rounding.

tests/integration/test_types_integration.py

hashhar · 2022-09-14T06:30:58Z

tests/integration/test_types_integration.py

+        .add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \
+        .add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \
+        .add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \
+        .add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \


@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.

tests/integration/test_types_integration.py

hashhar · 2022-09-14T06:31:55Z

tests/integration/test_types_integration.py

+        .add_field(sql="CAST('2001-08-22 01:23:45.123' AS TIMESTAMP(3))", python=timestamp_3) \
+        .add_field(sql="CAST('2001-08-22 01:23:45.123456' AS TIMESTAMP(6))", python=timestamp_6) \
+        .add_field(sql="CAST('2001-08-22 01:23:45.123456789' AS TIMESTAMP(9))", python=timestamp_6) \
+        .add_field(sql="CAST('2001-08-22 01:23:45.123456789123' AS TIMESTAMP(12))", python=timestamp_6) \


If we're actually rounding this should have failed since value read back would be .123457 instead.

tests/integration/test_types_integration.py

hashhar · 2022-09-14T06:33:03Z

trino/client.py

            raise exceptions.TrinoUserError("Query has been cancelled", self.query_id)

+        if self._experimental_python_types:
+            new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}


nit: extract constant for header value. Do we do this for other well-named header values as well?

hashhar · 2022-09-14T06:35:21Z

trino/client.py


+        if self._experimental_python_types:
+            new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'}
+            additional_http_headers = new_header | additional_http_headers if additional_http_headers else new_header


avoid using dict unions since they overwrite any existing values which can lead to subtle bugs being introduced.

Ideally we'd want to complain if a header whose value was already set in some-place is then tried to be overwritten (either because of bug in our code or because user is trying to access internals of client).

I see you changed logic in later commits.

tests/integration/test_dbapi_integration.py

hashhar · 2022-09-14T06:37:08Z

tests/integration/test_types_integration.py

@@ -280,12 +280,12 @@ def test_timestamp(trino_connection):


 def test_timestamp_with_timezone(trino_connection):


Squash "Refactor some test function names for consistency" since it changes something introduced in the first commit itself.

tests/integration/test_types_integration.py

trino/client.py

hashhar · 2022-09-14T06:44:41Z

trino/client.py

            pattern += ".%f"

-        dt_size = datetime_default_size + ms_size - ms_to_trim
+        dt_size = datetime_default_size


replace size with length?

dt_size-> timestamp_length

hashhar · 2022-09-14T06:46:29Z

trino/client.py

                return lambda val: \
-                    [datetime.strptime(val[:dt_size] + val[dt_tz_offset:], pattern + ' %z')
+                    [datetime.strptime(val[:dt_size]
+                     + str(round(int(val[dt_size:dt_tz_offset]) / ms_div))


note: round returns floats so it'll lose precision and cause incorrect rounding in some cases.

The round returns a float but the code only cares about the integer part. It takes the total number of digits after the decimal (e.g. 123456789), divides it by ms_div to only keep 6 digits (e.g. 123456.789), rounds this (123457) and converts it to a string. So to lose precision you would need to have a timestamp way more precise than 12 as round(0.499999999999999999) returns 0 and round(0.5000000000000001) return 1.

That's not the issue I'm thinking of. Try round(2.675, 2). It can happen for any other value since floats cannot represent all values exactly.

hashhar · 2022-09-14T06:47:17Z

tests/integration/test_types_integration.py

    time_0 = datetime.time(1, 23, 45)
    time_3 = datetime.time(1, 23, 45, 123000)
    time_6 = datetime.time(1, 23, 45, 123456)
+    time_trunc = datetime.time(1, 23, 45, 123457)


Also squash "Round milliseconds instead of truncating when needed" into others.

hashhar · 2022-09-14T20:00:45Z

@lpoulain Please try using fixup commits when addressing review comments otherwise it's impossible to see what changed between the force push and before that.

mdesmet · 2022-09-27T12:42:46Z

@lpoulain : can you rebase your changes on latest master?

…l_python_types flag Add variable precision time and timestamp support for the experimental_python_types flag

trino/client.py

mdesmet · 2022-10-26T12:01:10Z

trino/client.py


-    def _get_time_with_timezome(self, value, time_size, pattern):
-        matches = re.match(r'^(.*)([\+\-])(\d{2}):(\d{2})$', value)
+    def _get_time_with_timezone_round_ms(self, value, time_size, ms_div, pattern):


arguments sounds quite cryptic. Could we name them better?

trino/client.py

…ility

hovaesco · 2022-10-27T08:06:54Z

tests/integration/test_types_integration.py

@@ -1,5 +1,8 @@
 import math
+import datetime
+from datetime import timedelta


Full datetime module already imported above

trino/client.py

mdesmet · 2022-10-27T20:41:56Z

trino/client.py

-        ms_size, ms_to_trim = self._get_number_of_digits(column)
-        if ms_size > 0:
-            pattern += ".%f"
+class TimeRowMapperFactory(AbstractTemporalRowMapperFactory):


This is actually a value mapper and not a row mapper. Please adjust the name

hashhar

skimmed once again.

To be honest the rounding code can use a lot of simplification. I'll propose a PR. In current shape I'm not confident at all in it without a lot of test coverage specific to rounding.

hashhar · 2022-11-08T07:04:14Z

tests/integration/test_types_integration.py

+    SqlTest(trino_connection) \
+        .add_field(sql="CAST(null AS DATE)", python=None) \
+        .add_field(sql="DATE '2001-08-22'", python=date(2001, 8, 22)) \
+        .add_field(sql="DATE '0001-01-01'", python=date(1, 1, 1)) \


it's a good idea to have the dates sorted and to include a min value and max value so it's obvious what the boundaries are.

Also add a test to verify we get useful errors on querying years < 1 or years > 9999.

hashhar · 2022-11-08T07:15:26Z

tests/integration/test_types_integration.py

+    time_0 = time(1, 23, 45)
+    time_3 = time(1, 23, 45, 123000)
+    time_6 = time(1, 23, 45, 123456)
+    time_round = time(1, 23, 45, 123457)


we only test upwards rounding.

Add cases for:

TIME '00:00:00.0000004' -> TIME '00:00:00.000000' (minimal value with downwards rounding)

TIME '23:59:59.9999995' -> TIME '00:00:00.000000' (min value with upwards rounding)

TIME '23:59:59.9999999' -> TIME '00:00:00.000000' (max value with upwards rounding)

TIME '23:59:59.99999949999' -> TIME '23:59:59.999999' (round down)

Zero value with precision preserved

TIME '00:00:00' -> TIME '00:00:00'

TIME '00:00:00.000000' -> TIME '00:00:00.000000'

Same for test_time_with_timezone

Actually the rounding code does not work, and the only examples I could fine involve third party packages such as numpy. So open to suggestions for a better approach

Is 23:59:59.999999949999 really supposed to be rounded down when converted to microseconds? Shouldn't 999999.949999 microseconds be rounded up?

Sorry I added one more 9 in every case. Edited to correct.

hashhar · 2022-11-08T07:26:53Z

tests/integration/test_types_integration.py

+    timestamp_0 = datetime(2001, 8, 22, 1, 23, 45, 0)
+    timestamp_3 = datetime(2001, 8, 22, 1, 23, 45, 123000)
+    timestamp_6 = datetime(2001, 8, 22, 1, 23, 45, 123456)
+    timestamp_round = datetime(2001, 8, 22, 1, 23, 45, 123457)


TIMESTAMP '1970-01-01 00:00:00.12345671' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round down)
TIMESTAMP '1970-01-01 00:00:00.1234567499' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round up trailing but overall value rounds down)
TIMESTAMP '1970-01-01 00:00:00.99999995' -> TIMESTAMP '1970-01-01 00:00:01.0000000' (round up to next second)
TIMESTAMP '1970-01-01 23:59:59.99999995' -> TIMESTAMP '1970-01-02 00:00:00.0000000' (round up to next day)
TIMESTAMP '1969-12-31 23:59:59.99999995' -> TIMESTAMP '1970-01-01 00:00:00.0000000' (round up from before epoch to epoch)

Also sort according to value + add min and max supported value + verify what error we get for max.

hashhar · 2022-11-08T07:29:29Z

tests/integration/test_types_integration.py

+    timestamp_0 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 0)) - delta
+    timestamp_3 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123000)) - delta
+    timestamp_6 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123456)) - delta
+    timestamp_round = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123457)) - delta


this is not how timezones work. Depending on exact date the shift value can be different.

i.e. offset = tz.utcoffset(datetime.now()) will give different offsets based on different time of year or dates.

Is there no way to "apply" a timezone to a datetime?

hashhar · 2022-11-08T07:30:08Z

trino/client.py

-        dt_size = datetime_default_size + ms_size - ms_to_trim
-        dt_tz_offset = datetime_default_size + ms_size
-        if 'with time zone' in col_type:
+class AbstractTemporalValueMapperFactory:


can use abc.ABC

hashhar · 2022-11-08T07:30:23Z

trino/client.py

+        args = column['arguments']
+        if len(args) == 0:
+            return 3
+        ms_size = column['arguments'][0]['value']


rename ms_size to precision

hashhar · 2022-11-08T07:31:23Z

trino/client.py

-        dt_tz_offset = datetime_default_size + ms_size
-        if 'with time zone' in col_type:
+class AbstractTemporalValueMapperFactory:
+    def _get_number_of_millis_digits(self, column: Dict[str, Any]) -> int:


get_precision

hashhar · 2022-11-08T07:34:56Z

trino/client.py

+            return 3
+        ms_size = column['arguments'][0]['value']
+        if ms_size == 0:
+            return -1


why -1 here. Can it be simpler to use datetime_default_size=8 (instead of 9) and adjust get_number_of_millis_digits_to_trim instead?

It's clever but not readable.

hashhar · 2022-11-08T07:35:31Z

trino/client.py

+        datetime_default_size = 9  # size of 'HH:MM:SS.'
+        time_format = "%H:%M:%S"
+        millis_length = self._get_number_of_millis_digits(column)
+        millis_to_trim_div = self._get_number_of_millis_digits_to_trim(column, millis_length)


what does _div suffix mean? remove or rename.

It represents the number to divide by to remove the desired number of digits

hashhar · 2022-11-08T07:35:54Z

trino/client.py

+            else:
+                return self._map_time(time_size, time_format)
+
+    def _map_time_timezone_trim_millis_digits(self, datetime_default_size: int, millis_to_trim_div: int,


name says trim (as in truncate) but actual operation is rounding.

lpoulain · 2022-12-13T00:32:37Z

Closing PR as too many things have changed since. Will file a new PR

cla-bot bot added the cla-signed label Aug 23, 2022

lpoulain requested review from ebyhr, hashhar, mdesmet and nineinchnick August 23, 2022 17:21

nineinchnick reviewed Aug 24, 2022

View reviewed changes

hashhar reviewed Aug 24, 2022

View reviewed changes

findepi reviewed Aug 24, 2022

View reviewed changes

nineinchnick approved these changes Aug 24, 2022

View reviewed changes

hashhar reviewed Aug 24, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

hashhar reviewed Aug 24, 2022

View reviewed changes

tests/integration/test_types_integration.py Show resolved Hide resolved

hashhar reviewed Aug 24, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

findepi mentioned this pull request Aug 25, 2022

Update documentation for parametric TIME WITH TIME ZONE #5316 trinodb/trino#7912

Merged

aalbu requested a review from hashhar September 13, 2022 22:03

aalbu reviewed Sep 13, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

tests/integration/test_dbapi_integration.py Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Sep 14, 2022

View reviewed changes

lpoulain force-pushed the support-variable-precision-time-timestamp branch from 2d2d647 to 79a72f5 Compare September 14, 2022 12:06

Add variable precision time and timestamp support for the experimenta…

adf5394

…l_python_types flag Add variable precision time and timestamp support for the experimental_python_types flag

lpoulain force-pushed the support-variable-precision-time-timestamp branch from 05a35a8 to adf5394 Compare October 11, 2022 14:39

mdesmet reviewed Oct 26, 2022

View reviewed changes

Refactor _timestamp_map_func() and _time_map_func() for better readab…

9d38bf9

…ility

hovaesco approved these changes Oct 27, 2022

View reviewed changes

Split RowMapperFactory for time and timestamp processing

386e6ff

mdesmet reviewed Oct 27, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

mdesmet reviewed Oct 27, 2022

View reviewed changes

fixup! add type hints for the RowMapperFactory methods

d21007f

hashhar reviewed Nov 8, 2022

View reviewed changes

lpoulain closed this Dec 13, 2022

		@@ -280,12 +280,12 @@ def test_timestamp(trino_connection):


		def test_timestamp_with_timezone(trino_connection):

Conversation

lpoulain commented Aug 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nineinchnick Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashhar Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashhar left a comment

Choose a reason for hiding this comment

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

hashhar commented Aug 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hashhar commented Aug 25, 2022

Uh oh!

lpoulain commented Aug 26, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

nineinchnick Aug 24, 2022 •

edited

Loading

hashhar Aug 26, 2022 •

edited

Loading