Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-968 Add Hive/Presto support in ProtoParquet #1

Merged
merged 1 commit into from
Apr 30, 2018

Conversation

ggershinsky
Copy link
Owner

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}

Old parquet schema was:

message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}

New parquet schema is:

message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}

For list of messages, the changes look like this:

Protobuf schema:

message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema was:

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}

Similar for maps. Protobuf schema:

message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema:

message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}

New parquet schema (notice the MAP wrapper):

message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}

Jira: https://issues.apache.org/jira/browse/PARQUET-968

Author: Constantin Muraru [email protected]
Author: Benoît Hanotte [email protected]

Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits:

16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

```
message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}
```

Old parquet schema was:
```
message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}
```

New parquet schema is:
```
message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}
```
---

For list of messages, the changes look like this:

Protobuf schema:
```
message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema was:
```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}
```

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

---

Similar for maps. Protobuf schema:
```
message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema:
```
message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}
```

New parquet schema (notice the `MAP` wrapper):
```
message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

Jira: https://issues.apache.org/jira/browse/PARQUET-968

Author: Constantin Muraru <[email protected]>
Author: Benoît Hanotte <[email protected]>

Closes #411 from costimuraru/PARQUET-968 and squashes the following commits:

16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet
@ggershinsky ggershinsky merged commit 013d57f into ggershinsky:master Apr 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant