Commit f849384
PARQUET-968 Add Hive/Presto support in ProtoParquet
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.
This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.
Regarding the change.
Given the following protobuf schema:
```
message ListOfPrimitives {
repeated int64 my_repeated_id = 1;
}
```
Old parquet schema was:
```
message ListOfPrimitives {
repeated int64 my_repeated_id = 1;
}
```
New parquet schema is:
```
message ListOfPrimitives {
required group my_repeated_id (LIST) = 1 {
repeated group list {
required int64 element;
}
}
}
```
---
For list of messages, the changes look like this:
Protobuf schema:
```
message ListOfMessages {
string top_field = 1;
repeated MyInnerMessage first_array = 2;
}
message MyInnerMessage {
int32 inner_field = 1;
}
```
Old parquet schema was:
```
message TestProto3.ListOfMessages {
optional binary top_field (UTF8) = 1;
repeated group first_array = 2 {
optional int32 inner_field = 1;
}
}
```
The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):
```
message TestProto3.ListOfMessages {
optional binary top_field (UTF8) = 1;
required group first_array (LIST) = 2 {
repeated group list {
optional group element {
optional int32 inner_field = 1;
}
}
}
}
```
---
Similar for maps. Protobuf schema:
```
message TopMessage {
map<int64, MyInnerMessage> myMap = 1;
}
message MyInnerMessage {
int32 inner_field = 1;
}
```
Old parquet schema:
```
message TestProto3.TopMessage {
repeated group myMap = 1 {
optional int64 key = 1;
optional group value = 2 {
optional int32 inner_field = 1;
}
}
}
```
New parquet schema (notice the `MAP` wrapper):
```
message TestProto3.TopMessage {
required group myMap (MAP) = 1 {
repeated group key_value {
required int64 key;
optional group value {
optional int32 inner_field = 1;
}
}
}
}
```
Jira: https://issues.apache.org/jira/browse/PARQUET-968
Author: Constantin Muraru <[email protected]>
Author: Benoît Hanotte <[email protected]>
Closes #411 from costimuraru/PARQUET-968 and squashes the following commits:
16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet1 parent af977ad commit f849384
File tree
9 files changed
+1331
-84
lines changed- parquet-protobuf/src
- main/java/org/apache/parquet/proto
- test
- java/org/apache/parquet/proto
- utils
- resources
9 files changed
+1331
-84
lines changedLines changed: 125 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| 34 | + | |
33 | 35 | | |
34 | 36 | | |
35 | 37 | | |
| |||
126 | 128 | | |
127 | 129 | | |
128 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
129 | 137 | | |
130 | 138 | | |
131 | 139 | | |
132 | | - | |
133 | 140 | | |
134 | 141 | | |
135 | 142 | | |
| |||
342 | 349 | | |
343 | 350 | | |
344 | 351 | | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
345 | 469 | | |
Lines changed: 133 additions & 36 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
32 | 26 | | |
| 27 | + | |
| 28 | + | |
33 | 29 | | |
34 | 30 | | |
35 | 31 | | |
36 | 32 | | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | 33 | | |
43 | 34 | | |
44 | 35 | | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
45 | 42 | | |
46 | 43 | | |
47 | 44 | | |
48 | 45 | | |
49 | 46 | | |
50 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
51 | 64 | | |
52 | 65 | | |
53 | 66 | | |
| |||
60 | 73 | | |
61 | 74 | | |
62 | 75 | | |
63 | | - | |
64 | | - | |
| 76 | + | |
| 77 | + | |
65 | 78 | | |
66 | 79 | | |
67 | 80 | | |
| |||
70 | 83 | | |
71 | 84 | | |
72 | 85 | | |
73 | | - | |
| 86 | + | |
74 | 87 | | |
75 | 88 | | |
76 | 89 | | |
| |||
80 | 93 | | |
81 | 94 | | |
82 | 95 | | |
83 | | - | |
84 | | - | |
85 | | - | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
86 | 170 | | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
100 | 179 | | |
101 | 180 | | |
102 | 181 | | |
103 | 182 | | |
104 | 183 | | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
105 | 202 | | |
0 commit comments