Flink: Read parquet BINARY column as String for expected #8808

fengjiajie · 2023-10-12T10:36:20Z

Some systems like older versions of Impala do not annotate String type as UTF-8 columns in Parquet files. When importing these Parquet files into Iceberg, reading these Binary columns will encounter type errors.

nastra · 2023-10-13T09:30:54Z

@fengjiajie thanks for working on this. Could you please add a test that reproduces the issue?

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

fengjiajie · 2023-10-14T01:33:57Z

@nastra Thank you for taking the time to review my code

nastra · 2023-10-16T15:02:30Z

Some systems like older versions of Impala do not annotate String type as UTF-8 columns in Parquet files. When importing these Parquet files into Iceberg, reading these Binary columns will encounter type errors.

The code changes LGTM, but I wonder whether this issue isn't something that should be fixed in Impala itself?
Also curious what other think about fixing this in Iceberg @pvary @stevenzwu @Fokko

fengjiajie · 2023-10-17T01:11:35Z

Some systems like older versions of Impala do not annotate String type as UTF-8 columns in Parquet files. When importing these Parquet files into Iceberg, reading these Binary columns will encounter type errors.

The code changes LGTM, but I wonder whether this issue isn't something that should be fixed in Impala itself? Also curious what other think about fixing this in Iceberg @pvary @stevenzwu @Fokko

Hi @nastra, thank you for the feedback.
Impala is just an example. There are other Parquet data generated by various systems over the years that we want to add to the Iceberg table and read using Flink.

pvary · 2023-10-17T08:40:52Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

+          if (expected.typeId() == Types.StringType.get().typeId()) {
+            return new StringReader(desc);
+          } else {
+            return new ParquetValueReaders.ByteArrayReader(desc);
+          }


I am concerned about the backward compatibility of this change.
Someone might already depend on reading them as binary, and this change would break their use-case

I am concerned about the backward compatibility of this change. Someone might already depend on reading them as binary, and this change would break their use-case

This modification is only applicable to cases where the iceberg definition is 'string' and parquet column is 'binary'. Previously, such cases would encounter the following exception (unit test can reproduce this exception):

java.lang.ClassCastException: [B cannot be cast to org.apache.flink.table.data.StringData at org.apache.flink.table.data.GenericRowData.getString(GenericRowData.java:169) at org.apache.iceberg.flink.data.TestFlinkParquetReader.testReadBinaryFieldAsString(TestFlinkParquetReader.java:137)

pvary · 2023-10-17T08:41:09Z

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

+      assertThat(rows).as("Should not have more than one row").isExhausted();
+    }
+
+    // read as byte[]


Should this be a separate test case?

This unit test is testing whether expected values can be read from same file and column when the iceberg column definition is both string and binary.

pvary · 2023-10-17T08:44:45Z

This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a different backport PR to backport them to the older Flink versions.

This is better for the reviewer, as the change is smaller and easier to focus on the issues, and better for the contributor since when changes are required then they are not needed to be continuously merged to the other branches. And when the backport time comes, it is easier to check the backport specific changes.

fengjiajie · 2023-10-17T10:37:24Z

This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a different backport PR to backport them to the older Flink versions.

This is better for the reviewer, as the change is smaller and easier to focus on the issues, and better for the contributor since when changes are required then they are not needed to be continuously merged to the other branches. And when the backport time comes, it is easier to check the backport specific changes.

Thank you for your review suggestions. I have made the changes to only include flink 1.17.

pvary · 2023-10-18T05:23:32Z

@fengjiajie: Checked the codepath for the Spark readers and I have 2 questions:

What about ORC and Avro files? Don't we have the same issue there?
Would it worth to add the same fix for the Spark reader as well?

fengjiajie · 2023-10-18T06:35:18Z

@fengjiajie: Checked the codepath for the Spark readers and I have 2 questions:

What about ORC and Avro files? Don't we have the same issue there?

Would it worth to add the same fix for the Spark reader as well?

@pvary The issue is that the type of the read result should be determined by the column type definition in Iceberg, rather than the data type within the Parquet file.

It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types.
Avro reads the fields entirely based on the file type, which seems to be problematic. However, it doesn't have significant issues under Parquet because Avro natively supports STRING and BYTES types, whereas Parquet only has the Binary type (whether the field is a String is determined by additional annotations or external metadata).
The data type read should be consistent with the iceberg column type, so I think Spark should also incorporate this modification.
Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: Spark 3.3: Add read and write support for UUIDs #7496

nastra · 2023-10-18T06:48:29Z

Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: #7496

So Spark itself doesn't support UUIDs as a type, but rather expresses them as Strings. That's what #7496 added/fixed for Spark

pvary · 2023-10-18T11:55:45Z

It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types.

Avro reads the fields entirely based on the file type, which seems to be problematic. However, it doesn't have significant issues under Parquet because Avro natively supports STRING and BYTES types, whereas Parquet only has the Binary type (whether the field is a String is determined by additional annotations or external metadata).

Thanks for the explanation!

The data type read should be consistent with the iceberg column type, so I think Spark should also incorporate this modification.

How hard would it be to incorporate this to the Spark reader as well?
I am uncomfortable with these kind of fixes which are applied only to one of the engines.
If it is not too complicated we should add it here, if not, then we need to create a different PR.

Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: Spark 3.3: Add read and write support for UUIDs #7496

I think this is a bigger nut to crack. Probably worth another PR in Flink to fix this.

fengjiajie · 2023-10-19T03:18:04Z

It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types.

Avro reads the fields entirely based on the file type, which seems to be problematic. However, it doesn't have significant issues under Parquet because Avro natively supports STRING and BYTES types, whereas Parquet only has the Binary type (whether the field is a String is determined by additional annotations or external metadata).

Thanks for the explanation!

The data type read should be consistent with the iceberg column type, so I think Spark should also incorporate this modification.

How hard would it be to incorporate this to the Spark reader as well? I am uncomfortable with these kind of fixes which are applied only to one of the engines. If it is not too complicated we should add it here, if not, then we need to create a different PR.

Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: Spark 3.3: Add read and write support for UUIDs #7496

I think this is a bigger nut to crack. Probably worth another PR in Flink to fix this.

I made modifications on Spark 3.5. Before the changes, the following exception would occur:

[B cannot be cast to org.apache.spark.unsafe.types.UTF8String
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.unsafe.types.UTF8String
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:45)
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:45)
	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:165)
	at org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:35)

pvary · 2023-10-19T07:32:05Z

@RussellSpitzer: Could you please take a look at the Spark change?

RussellSpitzer · 2023-10-19T19:16:24Z

I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than changing our readers to read files that have been written incorrectly?

fengjiajie · 2023-10-20T01:19:01Z

how are we guaranteed that the binary is parsable as UTF8 bytes?

@RussellSpitzer Thank you for participating in the review.
If a column is not encoded in UTF-8, it should not be defined as a string type in the iceberg metadata.

The data reading type should be determined based on the column type definition in the iceberg metadata, rather than the column type definition in the parquet file. An imperfect analogy would be reading a CSV file where the column type is determined by the table's structural metadata during reading, rather than the type defined in the CSV file itself.

RussellSpitzer · 2023-10-20T01:54:45Z

My thoughts were more like, if we have a column defined as "double" we may allow "float" in the file definition but we wouldn't allow Binary. So how is this different?

fengjiajie · 2023-10-20T03:56:08Z

My thoughts were more like, if we have a column defined as "double" we may allow "float" in the file definition but we wouldn't allow Binary. So how is this different?

@RussellSpitzer I understand your point. This pull request can be seen as:

Improve the compatibility of data reading when the data content is known to be correctly encoded.
The hint that "binary is stored in UTF-8" comes from iceberg metadata.

fengjiajie · 2023-11-24T10:46:48Z

I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than changing our readers to read files that have been written incorrectly?

@RussellSpitzer Hi, can you please tell if this issue can be moved forward?

We have a lot of hive tables that contain such parquet files and we are trying to convert these hive tables into iceberg tables, this process of parquet files cannot be rewritten (because of the large number of history files).

We can guarantee that it could be parsed in UTF-8 because the data was originally defined as a string in hive.
If it wasn't a string before, there's no reason defining it as a string when defining the iceberg table would make it fail to parse.

RussellSpitzer · 2023-11-25T13:07:36Z

You can only guarantee this is safe for your data, for any other user this could be unsafe. That’s the underlying issue with this PR, we are essentially allowing a cast binary as string.Sent from my iPhoneOn Nov 24, 2023, at 4:47 AM, fengjiajie ***@***.***> wrote: I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than changing our readers to read files that have been written incorrectly? @RussellSpitzer Hi, can you please tell if this issue can be moved forward? We have a lot of hive tables that contain such parquet files and we are trying to convert these hive tables into iceberg tables, this process of parquet files cannot be rewritten (because of the large number of history files). We can guarantee that it could be parsed in UTF-8 because the data was originally defined as a string in hive. If it wasn't a string before, there's no reason defining it as a string when defining the iceberg table would make it fail to parse. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

fengjiajie · 2023-11-26T09:19:41Z

You can only guarantee this is safe for your data, for any other user this could be unsafe. That’s the underlying issue with this PR, we are essentially allowing a cast binary as string.Sent from my iPhoneOn Nov 24, 2023, at 4:47 AM, fengjiajie @.> wrote: I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than changing our readers to read files that have been written incorrectly? @RussellSpitzer Hi, can you please tell if this issue can be moved forward? We have a lot of hive tables that contain such parquet files and we are trying to convert these hive tables into iceberg tables, this process of parquet files cannot be rewritten (because of the large number of history files). We can guarantee that it could be parsed in UTF-8 because the data was originally defined as a string in hive. If it wasn't a string before, there's no reason defining it as a string when defining the iceberg table would make it fail to parse. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

@RussellSpitzer Thanks for the reply, but I still don't get it.

I don't quite understand why this is 'unsafe' for any other user?
Also this conversion is only possible if the user defines the iceberg column as a string. A user defining a column as string means that the user wants to use the value of the column as a string (string is required by the iceberg specification to be UTF-8 encoded, and the library decodes it as UTF-8 according to the specification), otherwise, if the user only wants the column to be used as a binary, the user should define the iceberg column as a binary type instead of a string type, and there would be no conversion.

Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, could you give an example?

RussellSpitzer · 2023-11-27T18:02:41Z

Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, could you give an example?

User sets the column as a string but the data is not UTF-8 encoded. Or worse, some files do have UTF-8 encoded binary and others do not.

Apache9 · 2023-11-28T02:59:54Z

Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, could you give an example?

User sets the column as a string but the data is not UTF-8 encoded. Or worse, some files do have UTF-8 encoded binary and others do not.

I think the logic here is straight forward for our users? If your data is not a string than you should not annotate it as a string right? If users define it as a string, it is the users' duty to make sure that the binary is a string.

And about the 'unsafe' problem.

WIthout the PR here, if users specify a binary data as string, it will just return byte array directly and the upper layer will crash because of type mismatch. And with the PR here, if the data is UTF-8 encoded, we are happy as our code could pass now. If the data is not UTF-8 encoded, we will crash, which is the same result before this PR.

So in general, the PR here does not add new crashing scenarios right?

Or at least, maybe we could introduce a fallback option here? If a binary data is annotated as string but no encoding information provided, we should use the fallback encoding to decode it?

WDYT? @RussellSpitzer

Thanks.

RussellSpitzer · 2023-11-28T03:43:29Z

It still feels like we are adding a special workaround to Iceberg for something which shouldn't be happening in the first place. Can you explain the use case again? Why can't the upstream file producer write correctly annotated binary columns?

Apache9 · 2023-11-28T10:24:28Z

It still feels like we are adding a special workaround to Iceberg for something which shouldn't be happening in the first place. Can you explain the use case again? Why can't the upstream file producer write correctly annotated binary columns?

Impala used to write data like this, and there are bunch of data already written like this so we need to find a workaround without rewritting all the parquet data. New data will be written 'correctly', of course, but It is impossible for users to stop their business for a week(or even more time) when upgrading to iceberg...

And I do not think the solution here breaks anything?

Thanks.

Fokko

This is essentially a schema promotion from binary to string. The data is stored as binary, but read as a string, something I proposed in #5151 a while ago.

I agree with Russel that this is fixing a bug in another system (Impala), and that makes it a bit of a slippery slope. However, if we want to add this at some point, then we can bring this in.

advancedxy · 2024-01-04T09:50:07Z

Spark adds a configuration to control whether to treat binary as string in parquet reader, see: https://github.com/apache/spark/blob/c8137960a0ba725d1633795a057c68f2bbef414b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L981

For this cases, the similar approach should be adapted, rather to promote binary to string regardless. It would be safer and it's up to the users to decide. WDYT @RussellSpitzer

openinx · 2024-01-08T08:15:34Z

@advancedxy Thanks for the context from spark. There was a mail discussing this issue, and we agreed that providing an option to promote binary to string is the correct approach. I think we are on the same page.

https://lists.apache.org/thread/r48tswg4f3gvzscgz7bzq7y10bpdnhz4

chenwyi2 · 2024-05-07T09:53:12Z

It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types.

Avro reads the fields entirely based on the file type, which seems to be problematic. However, it doesn't have significant issues under Parquet because Avro natively supports STRING and BYTES types, whereas Parquet only has the Binary type (whether the field is a String is determined by additional annotations or external metadata).

Thanks for the explanation!

The data type read should be consistent with the iceberg column type, so I think Spark should also incorporate this modification.

How hard would it be to incorporate this to the Spark reader as well? I am uncomfortable with these kind of fixes which are applied only to one of the engines. If it is not too complicated we should add it here, if not, then we need to create a different PR.

Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: Spark 3.3: Add read and write support for UUIDs #7496

I think this is a bigger nut to crack. Probably worth another PR in Flink to fix this.

I made modifications on Spark 3.5. Before the changes, the following exception would occur:
[B cannot be cast to org.apache.spark.unsafe.types.UTF8String
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.unsafe.types.UTF8String
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:45)
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:45)
	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:165)
	at org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:35)

i met this problem with spark 3.5, so how can we solve this?

github-actions · 2024-09-23T00:15:24Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-30T00:15:46Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the flink label Oct 12, 2023

jbonofre approved these changes Oct 12, 2023

View reviewed changes

fengjiajie force-pushed the binary_as_string branch 2 times, most recently from f921c46 to c446df6 Compare October 13, 2023 10:36

nastra reviewed Oct 13, 2023

View reviewed changes

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

fengjiajie force-pushed the binary_as_string branch from c446df6 to ae05b57 Compare October 13, 2023 11:03

fengjiajie requested a review from nastra October 13, 2023 11:05

nastra reviewed Oct 13, 2023

View reviewed changes

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

nastra reviewed Oct 13, 2023

View reviewed changes

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

nastra reviewed Oct 13, 2023

View reviewed changes

flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

fengjiajie force-pushed the binary_as_string branch from ae05b57 to 88b11ae Compare October 14, 2023 01:29

fengjiajie requested a review from nastra October 16, 2023 11:40

nastra requested review from pvary and stevenzwu October 17, 2023 07:40

pvary reviewed Oct 17, 2023

View reviewed changes

Flink: Read parquet BINARY column as String for expected

536f377

fengjiajie force-pushed the binary_as_string branch from 88b11ae to 536f377 Compare October 17, 2023 10:16

Spark 3.5: Read parquet BINARY column as String for expected

d380b64

github-actions bot added the spark label Oct 19, 2023

fengjiajie requested a review from pvary October 19, 2023 04:17

pvary approved these changes Oct 19, 2023

View reviewed changes

Fokko reviewed Nov 28, 2023

View reviewed changes

github-actions bot added the stale label Sep 23, 2024

github-actions bot closed this Sep 30, 2024

Flink: Read parquet BINARY column as String for expected #8808

Flink: Read parquet BINARY column as String for expected #8808

Uh oh!

Conversation

fengjiajie commented Oct 12, 2023

Uh oh!

nastra commented Oct 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fengjiajie commented Oct 14, 2023

Uh oh!

nastra commented Oct 16, 2023

Uh oh!

fengjiajie commented Oct 17, 2023

Uh oh!

pvary Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

fengjiajie Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

pvary Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

fengjiajie Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

pvary commented Oct 17, 2023

Uh oh!

fengjiajie commented Oct 17, 2023

Uh oh!

pvary commented Oct 18, 2023

Uh oh!

fengjiajie commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nastra commented Oct 18, 2023

Uh oh!

pvary commented Oct 18, 2023

Uh oh!

fengjiajie commented Oct 19, 2023

Uh oh!

pvary commented Oct 19, 2023

Uh oh!

RussellSpitzer commented Oct 19, 2023

Uh oh!

fengjiajie commented Oct 20, 2023

Uh oh!

RussellSpitzer commented Oct 20, 2023

Uh oh!

fengjiajie commented Oct 20, 2023

Uh oh!

fengjiajie commented Nov 24, 2023

Uh oh!

RussellSpitzer commented Nov 25, 2023 via email

Uh oh!

fengjiajie commented Nov 26, 2023

Uh oh!

RussellSpitzer commented Nov 27, 2023

Uh oh!

Apache9 commented Nov 28, 2023

Uh oh!

RussellSpitzer commented Nov 28, 2023

Uh oh!

Apache9 commented Nov 28, 2023

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

advancedxy commented Jan 4, 2024

Uh oh!

openinx commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenwyi2 commented May 7, 2024

Uh oh!

github-actions bot commented Sep 23, 2024

fengjiajie commented Oct 18, 2023 •

edited

Loading

openinx commented Jan 8, 2024 •

edited

Loading