Skip to content

Add decimal coercions for table/partition mismatch in Hive connector + refactor (v2)#9422

Closed
losipiuk wants to merge 10 commits intoprestodb:masterfrom
losipiuk:lo/decimal-hive-coercions
Closed

Add decimal coercions for table/partition mismatch in Hive connector + refactor (v2)#9422
losipiuk wants to merge 10 commits intoprestodb:masterfrom
losipiuk:lo/decimal-hive-coercions

Conversation

@losipiuk
Copy link
Contributor

This addresses:
#7658

This supersedes #7825.

Coercions for cases when Hive table schema does not match partition schema have been added for:

different decimals
decimal vs double
decimal vs float
This PR includes refactor of Hive coercions mechanism.

@losipiuk
Copy link
Contributor Author

losipiuk commented Dec 4, 2017

@electrum @martint I have a question on this one.

I noticed that code I wrote for DECIMAL to DOUBLE coercion in hive-connector does not behave exactly as it behaves in Hive (at least the version of Hive I played with, and version of Hive lib Presto uses).

In coercion code I used the same logic which we use for Presto's CAST from DOUBLE to DECIMAL(x,y).

The difference in behaviour between coercion in Presto and in Hive is visible for large double values when we actually go out of double precision.
E.g. 1234567890123456789012345.12345

  • is cast by Presto to DECIMAL(30,5) as 1234567890123456824475648.00000.
  • while Hive casts it as 1234567890123456800000000.12345.

Actually what Hive does, seems somewhat more reasonable as 1234567890123456789012345.12345 is actually 1.2345678901234568E+24. So putting some random digits for decimal positions for which we are lacking precision anyway does not seem like a great idea.

Note that current implementation implies different behaviour for different data types in Presto(depending if GenericHiveRecordCursor is used or not). This is not nice.

So we have 3 options here:

  1. Accept that it is not nice and for TEXT data type coercions work differently, than for ORC and other. TEXT being compatible with Hive.
  2. Change the coercion code in hive-connector for other formats so also ORC and other are compatible with what Hive does.
  3. Change the CAST of DOUBLE -> DECIMAL in Presto so it works like it works in Hive (that would also imply 2. - as coercion code currently uses the same code as cast).

It seems that either 2 or 3 is a way to go. I will yet have to see what standard says about how cast double->decimal should behave.

Any opinions on this one?

@sopel39
Copy link
Contributor

sopel39 commented Dec 5, 2017

It seems that either 2 or 3 is a way to go. I will yet have to see what standard says about how cast double->decimal should behave.

Looking at the standard it seems that we shouldn't round any significant digits, but otherwise we are free to round or truncate:

If there is a representation of SV in the data type TD that does not lose any
leading sig-nificant digits after rounding or truncating if necessary, then TV
is that representation. The choice of whether to round or truncate is implementation-defined.

Otherwise, an exception condition is raised: data exception — numeric value out of range.

Both presto and Hive seems (?) to not conform to SQL standard. I would rather have Presto conform to the standard and maybe have Hive-compatible casts in Hive connector only.

@losipiuk
Copy link
Contributor Author

losipiuk commented Dec 5, 2017

Yeah.

But now comes the question how do you define "significant digits" for double value.

Which of the digits in double 1234567890123456789012345.12345 are significant? Based on how String.valueOf(double) works (returning 1.2345678901234568E+24.), one can think that only prefix (12345678901234567) is significant.

Edit:

Standard is not very clear on what those are:

An approximate numeric value consists of a mantissa and an exponent.  The mantissa is a signed
numeric value, and the exponent is a signed integer that specifies the magnitude of the mantissa.
An approximate numeric value has a precision.  The precision is a positive integer that specifies the
number of significant binary digits in the mantissa.  The value of an approximate numeric value is
the mantissa multiplied by 10^x , where x is the exponent.

It says about number of significant digits in mantissa. But it does not say those are significant digits of double itself. Though it seems a reasonable assumption.

@martint
Copy link
Contributor

martint commented Dec 18, 2017

Which of the digits in double 1234567890123456789012345.12345 are significant?

That value is not a valid double, so that's the wrong question to ask :)

Based on how String.valueOf(double) works (returning 1.2345678901234568E+24.), one can think that only prefix (12345678901234567) is significant.

These would be the significant values (the mantissa): 12345678901234568. If we encounter such a double value (1.2345678901234568E+24), the right conversion to DECIMAL(30, 5) would be: 1234567890123456800000000.00000.

Regarding the original question:

The difference in behaviour between coercion in Presto and in Hive is visible for large double values when we actually go out of double precision.
E.g. 1234567890123456789012345.12345

  • is cast by Presto to DECIMAL(30,5) as 1234567890123456824475648.00000.
  • while Hive casts it as 1234567890123456800000000.12345.

... since that's not a valid double value, I'm not sure when that case would ever occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this commit?

The reason we have such an interface is to be able to override the coercion policy in our internal connector that's based on the Hive connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I did not know about the other implementation. I think (not sure exactly as it was long time ago ;) ) that I assumed that an interface is not needed for internal class with just a single implementation. I will revert this one.

@losipiuk losipiuk force-pushed the lo/decimal-hive-coercions branch 2 times, most recently from 4360bf1 to c62860d Compare December 18, 2017 22:51
@losipiuk losipiuk force-pushed the lo/decimal-hive-coercions branch from c62860d to fa3c74d Compare December 18, 2017 22:55
@losipiuk
Copy link
Contributor Author

losipiuk commented Dec 18, 2017

Which of the digits in double 1234567890123456789012345.12345 are significant?

That value is not a valid double, so that's the wrong question to ask :)

Good point @martint . I mean it is valid double literal in Presto. But internally it will represent double 1.2345678901234568E24.
That observation makes current behaviour of cast double -> decimal in Presto look nonsensical; as some extra digits
appear in number after cast out of the blue:

presto:default> select 
            cast(double '1234567890123456789012345.12345' as decimal(30,5)), 
            cast(1.2345678901234568E24 as decimal(30,5));
              _col0              |              _col1
---------------------------------+---------------------------------
 1234567890123456824475648.00000 | 1234567890123456824475648.00000
(1 row)

cc: @pnowojski

@losipiuk
Copy link
Contributor Author

logged #9575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants