-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2489] [SQL] Parquet support for fixed_len_byte_array #1737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @joesu, thanks for reporting and working on this issue. Instead of creating a new datatype, what do you think about just reading in fixed length byte arrays as our already existing BinaryType? This would give us compatibility without the added overhead of creating a new datatype. While I think it might be a reasonable optimization to add a fixed length byte type at some point in the future, doing so is a fairly major undertaking. Basically every place in the code where we match on datatypes will need to be updated. Therefore, before doing this I'd want to see a use case where the optimization paid off and a design doc on how we would implement it. |
|
ok to test |
|
QA tests have started for PR 1737 at commit
|
|
QA tests have finished for PR 1737 at commit
|
|
It's not that straightforward to reuse BinaryType for handling parquet's binary type and fixed_len_byte_array types because these two types are incompatible in the parquet library and we have to specify the data type when reading data from parquet files thought the library. Parquet library refuses to read data if you ask it to read binary typed data from a fixed_len_byte_array typed field. If we really want to reuse the BinaryType, we have to change all the Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes() function in ParquetTypes.scala) to consider the underlying file schema when mapping BinaryType to corresponding parquet types. Do you have any suggested way to do this? In the long run we might want to optimize storage for common fixed length things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data length in every data field for regular binary typed fields, but it only store the data length once in the metadata for fixed length byte array typed fields. It's a good fit to use fixed length byte array typed field to store the fixed length data. |
|
Another way is to include max length information in the BinaryType type, just like the FixedLenByteArray type in this pull request. Thus we can maintain only one binary data type for the fixed length and the variable length ones. What do you think about this approach? |
|
Can one of the admins verify this patch? |
|
@joesu, thanks for clarifying the issues with reading data from the parquet library. I like the idea of adding a new field to |
|
ok to test |
|
I did some experiments on reusing existing BinaryType but it does not quite work as expected. BinaryType is originally a case object, thus the length field in it will be shared among all apps. We have to change BinaryType to class class to hold the length information. However, it breaks all codes that assume BinaryType is case object. Do you have any suggestion? |
|
You are right that we would have to change the BinaryType to be a case class instead to hold this information and then change the rest of the code to deal with that. It is possible that we could play some tricks with the |
|
This patch enables the possibility of working with fixed_len_data_type fields in existing parquet files. This should be helpful for users migrating from other parquet-based systems. Must we reuse BinaryType to represent both existing binary data type and the fixed length type? |
|
The problem is that dataTypes are a public api so once we add one we are stuck with it for ever. Also, each new datatype adds significant overhead so I'd like to be pretty cautious about adding them when they are just special cases of existing types. We are already exploring the pattern of a single datatype with multiple settings elsewhere. There is a patch in the works that adds support for fixed and arbitrary precision decimal arithmetic using a single type. So if it is possible to do here as well I think that would be good. If the concern is primarily reading data from existing systems, what about a smaller initial patch that allows Spark SQL to read fixed length binary data, but just uses the existing BinaryType? We wouldn't be able to write out fixed length data, but this does seems like a good first step |
|
Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review. I'm happy to discuss the implementation further whenever you have time :) |
|
no problem. thanks for heads up! |
|
No updates on this ? |
|
I'm using spark 2.2.0 and still have this issue: |
…he#1737) Add `BosonFilter` case in `stripSparkFilter` in `SQLTestUtils` for Boson testing purpose
Make it possible to read parquet files created by avro with "fixed" type columns. The underlying type in parquet is fixed_len_byte_array. Without this patch, SparkSQL will fail when it tries to read data field defined in this type.
This pull request added a new type "FixedLenByteArrayType" mapping to the fixed_len_byte_array format.