Add variable precision time and timestamp support#229
Add variable precision time and timestamp support#229lpoulain wants to merge 4 commits intotrinodb:masterfrom
Conversation
| .add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \ | ||
| .add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \ | ||
| .add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \ | ||
| .add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \ |
There was a problem hiding this comment.
That may be a naive question, but is there a way of reading results with data types with high precision without them getting truncated? Would it require modifying the query to cast them to some other data type or can the driver be configured to return them in some raw form, like strings?
There was a problem hiding this comment.
i don't think we should automatically switch between datetime.time and str depending on the precision.
in JDBC, user needs to explicitly ask for a String to get one (this works for all precisions).
Is it possible with Python cursor API?
There was a problem hiding this comment.
I didn't suggest switching based on precision. There must be an option for users that work with high-precision timestamps, but I guess that can be follow-up work.
There was a problem hiding this comment.
@findepi Note that until recently everything returned by the Python client was untyped and str.
So it's still possible to return str when experimental_python_types is not enabled on the cursor.
But yes, for strongly typed time objects Python's stdlib is very limiting and there's nothing like getObject(idx, clazz) in DB-API.
There was a problem hiding this comment.
What other database clients do? (for DBs with more than ms precision)
There was a problem hiding this comment.
I know of only Oracle as another database that supports higher than millis precision and they don't expose that to the Python client since stdlib is limited to milliseconds.
There was a problem hiding this comment.
PostgreSQL, MySQL support microsecond timestamps
SQL Server, SAP HANA support timestamp with p=7
There was a problem hiding this comment.
Sorry I meant micros above. Micros is supported in Python stdlib so no issues with that. I'll see SQL Server and HANA. Thanks for the tip.
EDIT: Yeah, SQL Server seems to truncate extra precision. HANA's driver isn't open source so I can't tell what they do.
There was a problem hiding this comment.
@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.
|
|
||
| response = self._request.post(self._sql, additional_http_headers) | ||
| if self._experimental_python_types: | ||
| http_headers = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'} |
There was a problem hiding this comment.
Another naive question - how come this doesn't require any other changes in how the client parses results from the server?
There was a problem hiding this comment.
What's the current behavior, without setting this header?
There was a problem hiding this comment.
The server just starts sending back the precision in the type signatures and the values have additional precision to match.
There was a problem hiding this comment.
The client capabilities doesn't seem to be documented anywhere, but I found this comment in the server: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/server/protocol/QueryResultRows.java#L355
So looks like it was rounded before in the server, and now it'll get truncated in the client, but still with higher precision (6 vs 3). So it's an improvement.
There was a problem hiding this comment.
All the other Trino clients I've seen (JDBC, ODBC, Go) are all passing X-Trino-Client-Capabilityes: PARAMETRIC_DATETIME to Trino when executing a query. The Python library is the only one that doesn't. Unfortunately, Python datetimes only support a 6-digit precision max.
There was a problem hiding this comment.
@nineinchnick as far as how come no other change is required on how the client parses the results from the server, it's because in my last PR I made sure that the code worked with either type of result.
There was a problem hiding this comment.
trino-go-client doesn't pass that header either: trinodb/trino-go-client#13
|
one question - why can this new behaviour not be exposed for cursors without |
| SqlTest(trino_connection) \ | ||
| .add_field(sql="CAST(null AS DATE)", python=None) \ | ||
| .add_field(sql="DATE '2001-08-22'", python=datetime.date(2001, 8, 22)) \ | ||
| .add_field(sql="DATE '02001-08-22'", python=datetime.date(2001, 8, 22)) \ |
There was a problem hiding this comment.
This should be considered a bug. Our RowMapper should raise errors for such values which we can't handle properly.
There was a problem hiding this comment.
🤯 It seems Trino allows this for some reason. @findepi Do you know why?
There was a problem hiding this comment.
DATE literals (and varchar cast to date) allow zero padding, surrounding whitespace and year sign (+ or -).
It's however unrelated to the client.
Since DATE '2001-08-22' and DATE '02001-08-22' is just two ways of representing the same value, py client tests don't (and shouldn't) test both.
There was a problem hiding this comment.
ping @lpoulain - let's not test engine level functionality in this test. i.e. the conversion from 02001 to 2001 happens inside Trino so the client shouldn't care at all how or whether it works or not.
It might be more useful to test dates before 1582 (when Julian to Gregorian calendar switch happenned).
0001-01-011582-10-04(before the switch)1582-10-05(beginning of switch)1582-10-14(end of switch)
There was a problem hiding this comment.
reminder to remove DATE '02001-08-22' from test.
|
@lpoulain Can you please update behavior to match the JDBC driver (rounding instead of truncation)? Other than that (and possibly allowing using this with experimental_python_types disabled) this seems good to me. |
|
@hashhar done. |
| .add_field(sql="CAST('01:23:45.123' AS TIME(3))", python=time_3) \ | ||
| .add_field(sql="CAST('01:23:45.123456' AS TIME(6))", python=time_6) \ | ||
| .add_field(sql="CAST('01:23:45.123456789' AS TIME(9))", python=time_6) \ | ||
| .add_field(sql="CAST('01:23:45.123456789123' AS TIME(12))", python=time_6) \ |
There was a problem hiding this comment.
If this test is passing then it seems we aren't performing the rounding.
| .add_field(sql="TIME '01:23:45.123 %s'" % (timezone_str), python=time_3) \ | ||
| .add_field(sql="CAST('01:23:45.123 %s' AS TIME(3) WITH TIME ZONE)" % (timezone_str), python=time_3) \ | ||
| .add_field(sql="CAST('01:23:45.123456 %s' AS TIME(6) WITH TIME ZONE)" % (timezone_str), python=time_6) \ | ||
| .add_field(sql="CAST('01:23:45.123456789 %s' AS TIME(9) WITH TIME ZONE)" % (timezone_str), python=time_6) \ |
There was a problem hiding this comment.
@lpoulain If we're actually performing rounding then 01:23:45.123456789 should've been read back as 01:23:45.123457 instead.
| .add_field(sql="CAST('2001-08-22 01:23:45.123' AS TIMESTAMP(3))", python=timestamp_3) \ | ||
| .add_field(sql="CAST('2001-08-22 01:23:45.123456' AS TIMESTAMP(6))", python=timestamp_6) \ | ||
| .add_field(sql="CAST('2001-08-22 01:23:45.123456789' AS TIMESTAMP(9))", python=timestamp_6) \ | ||
| .add_field(sql="CAST('2001-08-22 01:23:45.123456789123' AS TIMESTAMP(12))", python=timestamp_6) \ |
There was a problem hiding this comment.
If we're actually rounding this should have failed since value read back would be .123457 instead.
trino/client.py
Outdated
| raise exceptions.TrinoUserError("Query has been cancelled", self.query_id) | ||
|
|
||
| if self._experimental_python_types: | ||
| new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'} |
There was a problem hiding this comment.
nit: extract constant for header value. Do we do this for other well-named header values as well?
trino/client.py
Outdated
|
|
||
| if self._experimental_python_types: | ||
| new_header = {constants.HEADER_CLIENT_CAPABILITIES: 'PARAMETRIC_DATETIME'} | ||
| additional_http_headers = new_header | additional_http_headers if additional_http_headers else new_header |
There was a problem hiding this comment.
avoid using dict unions since they overwrite any existing values which can lead to subtle bugs being introduced.
Ideally we'd want to complain if a header whose value was already set in some-place is then tried to be overwritten (either because of bug in our code or because user is trying to access internals of client).
There was a problem hiding this comment.
I see you changed logic in later commits.
| @@ -280,12 +280,12 @@ def test_timestamp(trino_connection): | |||
|
|
|||
|
|
|||
| def test_timestamp_with_timezone(trino_connection): | |||
There was a problem hiding this comment.
Squash "Refactor some test function names for consistency" since it changes something introduced in the first commit itself.
trino/client.py
Outdated
| pattern += ".%f" | ||
|
|
||
| dt_size = datetime_default_size + ms_size - ms_to_trim | ||
| dt_size = datetime_default_size |
There was a problem hiding this comment.
replace size with length?
dt_size-> timestamp_length
trino/client.py
Outdated
| return lambda val: \ | ||
| [datetime.strptime(val[:dt_size] + val[dt_tz_offset:], pattern + ' %z') | ||
| [datetime.strptime(val[:dt_size] | ||
| + str(round(int(val[dt_size:dt_tz_offset]) / ms_div)) |
There was a problem hiding this comment.
note: round returns floats so it'll lose precision and cause incorrect rounding in some cases.
There was a problem hiding this comment.
The round returns a float but the code only cares about the integer part. It takes the total number of digits after the decimal (e.g. 123456789), divides it by ms_div to only keep 6 digits (e.g. 123456.789), rounds this (123457) and converts it to a string. So to lose precision you would need to have a timestamp way more precise than 12 as round(0.499999999999999999) returns 0 and round(0.5000000000000001) return 1.
There was a problem hiding this comment.
That's not the issue I'm thinking of. Try round(2.675, 2). It can happen for any other value since floats cannot represent all values exactly.
| time_0 = datetime.time(1, 23, 45) | ||
| time_3 = datetime.time(1, 23, 45, 123000) | ||
| time_6 = datetime.time(1, 23, 45, 123456) | ||
| time_trunc = datetime.time(1, 23, 45, 123457) |
There was a problem hiding this comment.
Also squash "Round milliseconds instead of truncating when needed" into others.
2d2d647 to
79a72f5
Compare
|
@lpoulain Please try using fixup commits when addressing review comments otherwise it's impossible to see what changed between the force push and before that. |
|
@lpoulain : can you rebase your changes on latest master? |
…l_python_types flag Add variable precision time and timestamp support for the experimental_python_types flag
05a35a8 to
adf5394
Compare
trino/client.py
Outdated
|
|
||
| def _get_time_with_timezome(self, value, time_size, pattern): | ||
| matches = re.match(r'^(.*)([\+\-])(\d{2}):(\d{2})$', value) | ||
| def _get_time_with_timezone_round_ms(self, value, time_size, ms_div, pattern): |
There was a problem hiding this comment.
arguments sounds quite cryptic. Could we name them better?
| @@ -1,5 +1,8 @@ | |||
| import math | |||
| import datetime | |||
| from datetime import timedelta | |||
There was a problem hiding this comment.
Full datetime module already imported above
trino/client.py
Outdated
| ms_size, ms_to_trim = self._get_number_of_digits(column) | ||
| if ms_size > 0: | ||
| pattern += ".%f" | ||
| class TimeRowMapperFactory(AbstractTemporalRowMapperFactory): |
There was a problem hiding this comment.
This is actually a value mapper and not a row mapper. Please adjust the name
hashhar
left a comment
There was a problem hiding this comment.
skimmed once again.
To be honest the rounding code can use a lot of simplification. I'll propose a PR. In current shape I'm not confident at all in it without a lot of test coverage specific to rounding.
| SqlTest(trino_connection) \ | ||
| .add_field(sql="CAST(null AS DATE)", python=None) \ | ||
| .add_field(sql="DATE '2001-08-22'", python=date(2001, 8, 22)) \ | ||
| .add_field(sql="DATE '0001-01-01'", python=date(1, 1, 1)) \ |
There was a problem hiding this comment.
it's a good idea to have the dates sorted and to include a min value and max value so it's obvious what the boundaries are.
Also add a test to verify we get useful errors on querying years < 1 or years > 9999.
| time_0 = time(1, 23, 45) | ||
| time_3 = time(1, 23, 45, 123000) | ||
| time_6 = time(1, 23, 45, 123456) | ||
| time_round = time(1, 23, 45, 123457) |
There was a problem hiding this comment.
we only test upwards rounding.
Add cases for:
- TIME '00:00:00.0000004' -> TIME '00:00:00.000000' (minimal value with downwards rounding)
- TIME '23:59:59.9999995' -> TIME '00:00:00.000000' (min value with upwards rounding)
- TIME '23:59:59.9999999' -> TIME '00:00:00.000000' (max value with upwards rounding)
- TIME '23:59:59.99999949999' -> TIME '23:59:59.999999' (round down)
Zero value with precision preserved
- TIME '00:00:00' -> TIME '00:00:00'
- TIME '00:00:00.000000' -> TIME '00:00:00.000000'
There was a problem hiding this comment.
Same for test_time_with_timezone
There was a problem hiding this comment.
Actually the rounding code does not work, and the only examples I could fine involve third party packages such as numpy. So open to suggestions for a better approach
There was a problem hiding this comment.
Is 23:59:59.999999949999 really supposed to be rounded down when converted to microseconds? Shouldn't 999999.949999 microseconds be rounded up?
There was a problem hiding this comment.
Sorry I added one more 9 in every case. Edited to correct.
| timestamp_0 = datetime(2001, 8, 22, 1, 23, 45, 0) | ||
| timestamp_3 = datetime(2001, 8, 22, 1, 23, 45, 123000) | ||
| timestamp_6 = datetime(2001, 8, 22, 1, 23, 45, 123456) | ||
| timestamp_round = datetime(2001, 8, 22, 1, 23, 45, 123457) |
There was a problem hiding this comment.
TIMESTAMP '1970-01-01 00:00:00.12345671' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round down)
TIMESTAMP '1970-01-01 00:00:00.1234567499' -> TIMESTAMP '1970-01-01 00:00:00.1234567' (round up trailing but overall value rounds down)
TIMESTAMP '1970-01-01 00:00:00.99999995' -> TIMESTAMP '1970-01-01 00:00:01.0000000' (round up to next second)
TIMESTAMP '1970-01-01 23:59:59.99999995' -> TIMESTAMP '1970-01-02 00:00:00.0000000' (round up to next day)
TIMESTAMP '1969-12-31 23:59:59.99999995' -> TIMESTAMP '1970-01-01 00:00:00.0000000' (round up from before epoch to epoch)
There was a problem hiding this comment.
Also sort according to value + add min and max supported value + verify what error we get for max.
| timestamp_0 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 0)) - delta | ||
| timestamp_3 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123000)) - delta | ||
| timestamp_6 = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123456)) - delta | ||
| timestamp_round = tz.localize(datetime(2001, 8, 22, 11, 23, 45, 123457)) - delta |
There was a problem hiding this comment.
this is not how timezones work. Depending on exact date the shift value can be different.
i.e. offset = tz.utcoffset(datetime.now()) will give different offsets based on different time of year or dates.
Is there no way to "apply" a timezone to a datetime?
| dt_size = datetime_default_size + ms_size - ms_to_trim | ||
| dt_tz_offset = datetime_default_size + ms_size | ||
| if 'with time zone' in col_type: | ||
| class AbstractTemporalValueMapperFactory: |
| args = column['arguments'] | ||
| if len(args) == 0: | ||
| return 3 | ||
| ms_size = column['arguments'][0]['value'] |
| dt_tz_offset = datetime_default_size + ms_size | ||
| if 'with time zone' in col_type: | ||
| class AbstractTemporalValueMapperFactory: | ||
| def _get_number_of_millis_digits(self, column: Dict[str, Any]) -> int: |
| return 3 | ||
| ms_size = column['arguments'][0]['value'] | ||
| if ms_size == 0: | ||
| return -1 |
There was a problem hiding this comment.
why -1 here. Can it be simpler to use datetime_default_size=8 (instead of 9) and adjust get_number_of_millis_digits_to_trim instead?
It's clever but not readable.
| datetime_default_size = 9 # size of 'HH:MM:SS.' | ||
| time_format = "%H:%M:%S" | ||
| millis_length = self._get_number_of_millis_digits(column) | ||
| millis_to_trim_div = self._get_number_of_millis_digits_to_trim(column, millis_length) |
There was a problem hiding this comment.
what does _div suffix mean? remove or rename.
There was a problem hiding this comment.
It represents the number to divide by to remove the desired number of digits
| else: | ||
| return self._map_time(time_size, time_format) | ||
|
|
||
| def _map_time_timezone_trim_millis_digits(self, datetime_default_size: int, millis_to_trim_div: int, |
There was a problem hiding this comment.
name says trim (as in truncate) but actual operation is rounding.
|
Closing PR as too many things have changed since. Will file a new PR |
Add variable precision time and timestamp support for the experimental_python_types flag by passing the
X-Trino-Client-Capabilities: PARAMETRIC_DATETIMEheader in the query request, just like the Trino JDBC and ODBC drivers do.Updates the tests/integration/test_types_integration.py accordingly.