Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Aug 2, 2014

Make it possible to read parquet files created by avro with "fixed" type columns. The underlying type in parquet is fixed_len_byte_array. Without this patch, SparkSQL will fail when it tries to read data field defined in this type.

This pull request added a new type "FixedLenByteArrayType" mapping to the fixed_len_byte_array format.

@marmbrus
Copy link
Contributor

Hi @joesu, thanks for reporting and working on this issue. Instead of creating a new datatype, what do you think about just reading in fixed length byte arrays as our already existing BinaryType? This would give us compatibility without the added overhead of creating a new datatype.

While I think it might be a reasonable optimization to add a fixed length byte type at some point in the future, doing so is a fairly major undertaking. Basically every place in the code where we match on datatypes will need to be updated. Therefore, before doing this I'd want to see a use case where the optimization paid off and a design doc on how we would implement it.

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 1737 at commit f66e658.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 1737 at commit f66e658.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
    • case class FixedLenByteArrayType( length:Int ) extends DataType with PrimitiveType

@ghost
Copy link
Author

ghost commented Aug 31, 2014

It's not that straightforward to reuse BinaryType for handling parquet's binary type and fixed_len_byte_array types because these two types are incompatible in the parquet library and we have to specify the data type when reading data from parquet files thought the library. Parquet library refuses to read data if you ask it to read binary typed data from a fixed_len_byte_array typed field.

If we really want to reuse the BinaryType, we have to change all the Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes() function in ParquetTypes.scala) to consider the underlying file schema when mapping BinaryType to corresponding parquet types. Do you have any suggested way to do this?

In the long run we might want to optimize storage for common fixed length things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data length in every data field for regular binary typed fields, but it only store the data length once in the metadata for fixed length byte array typed fields. It's a good fit to use fixed length byte array typed field to store the fixed length data.

@ghost
Copy link
Author

ghost commented Aug 31, 2014

Another way is to include max length information in the BinaryType type, just like the FixedLenByteArray type in this pull request. Thus we can maintain only one binary data type for the fixed length and the variable length ones. What do you think about this approach?

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

marmbrus commented Sep 9, 2014

@joesu, thanks for clarifying the issues with reading data from the parquet library. I like the idea of adding a new field to BinaryType, fixedLength: Option[Int], that could be used to distinguish these two storage representation. We can have this field default to None so we don't break any existing code. In particular, since both types are going to be represented as Array[Byte] elsewhere in the Spark SQL execution engine, this means we don't have to add any extra handling code. This is purely an optimization when writing out data.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 9, 2014

ok to test

@josephsu
Copy link
Contributor

I did some experiments on reusing existing BinaryType but it does not quite work as expected.

BinaryType is originally a case object, thus the length field in it will be shared among all apps. We have to change BinaryType to class class to hold the length information. However, it breaks all codes that assume BinaryType is case object.

Do you have any suggestion?

@marmbrus
Copy link
Contributor

marmbrus commented Oct 2, 2014

You are right that we would have to change the BinaryType to be a case class instead to hold this information and then change the rest of the code to deal with that. It is possible that we could play some tricks with the unapply method in the BinaryType companion object to minimize the changes to pattern matching code, I'd have to play around with it more to see if that is actually feasible though.

@josephsu
Copy link
Contributor

This patch enables the possibility of working with fixed_len_data_type fields in existing parquet files. This should be helpful for users migrating from other parquet-based systems. Must we reuse BinaryType to represent both existing binary data type and the fixed length type?

@marmbrus
Copy link
Contributor

The problem is that dataTypes are a public api so once we add one we are stuck with it for ever. Also, each new datatype adds significant overhead so I'd like to be pretty cautious about adding them when they are just special cases of existing types.

We are already exploring the pattern of a single datatype with multiple settings elsewhere. There is a patch in the works that adds support for fixed and arbitrary precision decimal arithmetic using a single type. So if it is possible to do here as well I think that would be good.

If the concern is primarily reading data from existing systems, what about a smaller initial patch that allows Spark SQL to read fixed length binary data, but just uses the existing BinaryType? We wouldn't be able to write out fixed length data, but this does seems like a good first step

@marmbrus
Copy link
Contributor

marmbrus commented Dec 2, 2014

Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review. I'm happy to discuss the implementation further whenever you have time :)

@asfgit asfgit closed this in b0a46d8 Dec 2, 2014
@josephsu
Copy link
Contributor

josephsu commented Dec 2, 2014

no problem. thanks for heads up!

@praetp
Copy link

praetp commented Aug 8, 2017

No updates on this ?
We are still hitting
org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY;

@mukunku
Copy link

mukunku commented Nov 14, 2017

I'm using spark 2.2.0 and still have this issue:
Illegal Parquet type: FIXED_LEN_BYTE_ARRAY

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…he#1737)

Add `BosonFilter` case in `stripSparkFilter` in `SQLTestUtils` for Boson testing purpose
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants