Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet打包有问题,升级parquet到1.12 #2

Open
hn5092 opened this issue Apr 25, 2019 · 1 comment
Open

parquet打包有问题,升级parquet到1.12 #2

hn5092 opened this issue Apr 25, 2019 · 1 comment
Assignees

Comments

@hn5092
Copy link

hn5092 commented Apr 25, 2019

parquet对于it.unimi.dsi:fastutil的shade用了不同的前缀.导致查询时候出现class not found

@hn5092
Copy link
Author

hn5092 commented Apr 28, 2019

yabola pushed a commit that referenced this issue May 18, 2023
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

```
message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}
```

Old parquet schema was:
```
message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}
```

New parquet schema is:
```
message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}
```
---

For list of messages, the changes look like this:

Protobuf schema:
```
message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema was:
```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}
```

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

---

Similar for maps. Protobuf schema:
```
message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema:
```
message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}
```

New parquet schema (notice the `MAP` wrapper):
```
message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

Jira: https://issues.apache.org/jira/browse/PARQUET-968

Author: Constantin Muraru <[email protected]>
Author: Benoît Hanotte <[email protected]>

Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits:

16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant