[SPARK-2489] [SQL] Parquet support for fixed_len_byte_array #1737

ghost · 2014-08-02T08:39:22Z

Make it possible to read parquet files created by avro with "fixed" type columns. The underlying type in parquet is fixed_len_byte_array. Without this patch, SparkSQL will fail when it tries to read data field defined in this type.

This pull request added a new type "FixedLenByteArrayType" mapping to the fixed_len_byte_array format.

marmbrus · 2014-08-27T20:03:08Z

Hi @joesu, thanks for reporting and working on this issue. Instead of creating a new datatype, what do you think about just reading in fixed length byte arrays as our already existing BinaryType? This would give us compatibility without the added overhead of creating a new datatype.

While I think it might be a reasonable optimization to add a fixed length byte type at some point in the future, doing so is a fairly major undertaking. Basically every place in the code where we match on datatypes will need to be updated. Therefore, before doing this I'd want to see a use case where the optimization paid off and a design doc on how we would implement it.

marmbrus · 2014-08-27T20:03:16Z

ok to test

SparkQA · 2014-08-27T20:06:04Z

QA tests have started for PR 1737 at commit f66e658.

This patch merges cleanly.

SparkQA · 2014-08-27T21:12:14Z

QA tests have finished for PR 1737 at commit f66e658.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class FixedLenByteArrayType( length:Int ) extends DataType with PrimitiveType

ghost · 2014-08-31T02:54:23Z

It's not that straightforward to reuse BinaryType for handling parquet's binary type and fixed_len_byte_array types because these two types are incompatible in the parquet library and we have to specify the data type when reading data from parquet files thought the library. Parquet library refuses to read data if you ask it to read binary typed data from a fixed_len_byte_array typed field.

If we really want to reuse the BinaryType, we have to change all the Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes() function in ParquetTypes.scala) to consider the underlying file schema when mapping BinaryType to corresponding parquet types. Do you have any suggested way to do this?

In the long run we might want to optimize storage for common fixed length things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data length in every data field for regular binary typed fields, but it only store the data length once in the metadata for fixed length byte array typed fields. It's a good fit to use fixed length byte array typed field to store the fixed length data.

ghost · 2014-08-31T03:02:56Z

Another way is to include max length information in the BinaryType type, just like the FixedLenByteArray type in this pull request. Thus we can maintain only one binary data type for the fixed length and the variable length ones. What do you think about this approach?

SparkQA · 2014-09-05T23:44:02Z

Can one of the admins verify this patch?

marmbrus · 2014-09-09T02:44:20Z

@joesu, thanks for clarifying the issues with reading data from the parquet library. I like the idea of adding a new field to BinaryType, fixedLength: Option[Int], that could be used to distinguish these two storage representation. We can have this field default to None so we don't break any existing code. In particular, since both types are going to be represented as Array[Byte] elsewhere in the Spark SQL execution engine, this means we don't have to add any extra handling code. This is purely an optimization when writing out data.

marmbrus · 2014-09-09T02:44:27Z

ok to test

josephsu · 2014-09-23T17:16:10Z

I did some experiments on reusing existing BinaryType but it does not quite work as expected.

BinaryType is originally a case object, thus the length field in it will be shared among all apps. We have to change BinaryType to class class to hold the length information. However, it breaks all codes that assume BinaryType is case object.

Do you have any suggestion?

marmbrus · 2014-10-02T01:08:10Z

You are right that we would have to change the BinaryType to be a case class instead to hold this information and then change the rest of the code to deal with that. It is possible that we could play some tricks with the unapply method in the BinaryType companion object to minimize the changes to pattern matching code, I'd have to play around with it more to see if that is actually feasible though.

josephsu · 2014-10-28T04:24:39Z

This patch enables the possibility of working with fixed_len_data_type fields in existing parquet files. This should be helpful for users migrating from other parquet-based systems. Must we reuse BinaryType to represent both existing binary data type and the fixed length type?

marmbrus · 2014-10-28T20:41:34Z

The problem is that dataTypes are a public api so once we add one we are stuck with it for ever. Also, each new datatype adds significant overhead so I'd like to be pretty cautious about adding them when they are just special cases of existing types.

We are already exploring the pattern of a single datatype with multiple settings elsewhere. There is a patch in the works that adds support for fixed and arbitrary precision decimal arithmetic using a single type. So if it is possible to do here as well I think that would be good.

If the concern is primarily reading data from existing systems, what about a smaller initial patch that allows Spark SQL to read fixed length binary data, but just uses the existing BinaryType? We wouldn't be able to write out fixed length data, but this does seems like a good first step

marmbrus · 2014-12-02T01:17:43Z

Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review. I'm happy to discuss the implementation further whenever you have time :)

josephsu · 2014-12-02T01:28:32Z

no problem. thanks for heads up!

praetp · 2017-08-08T09:19:33Z

No updates on this ?
We are still hitting
org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY;

mukunku · 2017-11-14T16:11:06Z

I'm using spark 2.2.0 and still have this issue:
Illegal Parquet type: FIXED_LEN_BYTE_ARRAY

…he#1737) Add `BosonFilter` case in `stripSparkFilter` in `SQLTestUtils` for Boson testing purpose

josephsu added 7 commits August 27, 2014 17:43

initial commit

436bb4b

properly handle convert

d15d967

fixed length array cast to string

f319d2f

scala style

de9c43f

cast between binary and fixedlenbinary

c4bb207

change type name

7dc32ba

compatiable with latest version

f66e658

asfgit closed this in b0a46d8 Dec 2, 2014

aws-awinstan mentioned this pull request Mar 14, 2018

[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #20826

Closed

kazuyukitanimura mentioned this pull request Nov 14, 2022

[SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type #38628

Closed

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023

rdar://108296516 [Boson] handle BosonFilter in stripSparkFilter (apac…

bd46e1f

…he#1737) Add `BosonFilter` case in `stripSparkFilter` in `SQLTestUtils` for Boson testing purpose

[SPARK-2489] [SQL] Parquet support for fixed_len_byte_array #1737

[SPARK-2489] [SQL] Parquet support for fixed_len_byte_array #1737

Uh oh!

Conversation

ghost commented Aug 2, 2014

Uh oh!

marmbrus commented Aug 27, 2014

Uh oh!

marmbrus commented Aug 27, 2014

Uh oh!

SparkQA commented Aug 27, 2014

Uh oh!

SparkQA commented Aug 27, 2014

Uh oh!

ghost commented Aug 31, 2014

Uh oh!

ghost commented Aug 31, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

marmbrus commented Sep 9, 2014

Uh oh!

marmbrus commented Sep 9, 2014

Uh oh!

josephsu commented Sep 23, 2014

Uh oh!

marmbrus commented Oct 2, 2014

Uh oh!

josephsu commented Oct 28, 2014

Uh oh!

marmbrus commented Oct 28, 2014

Uh oh!

marmbrus commented Dec 2, 2014

Uh oh!

josephsu commented Dec 2, 2014

Uh oh!

praetp commented Aug 8, 2017

Uh oh!

mukunku commented Nov 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants