Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-756: Add Union Logical type #44

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,58 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

### Unions

A Union type annotates data stored as a Group.
It describes the different possible types under the same field name.
The group contains one optional field for each possible type in the Union.
The names of the fields in the annotated Group are not important, but as a convention the type names are used.

#### Nullability
- If the union is not nullable then exactly one field is non-null and the field containing the union is required.
```
// Union<String, Integer, Boolean> (where the value of the union is not null)
// (exactly one of either String, Integer or Boolean is non-null)
required group my_union (UNION) {
optional binary string (UTF8);
optional int32 integer;
optional boolean bool;
}
```
A projection might return an empty union if the non-null field is projected out. However we know that the Union is non-null,
it just contains a value that was not read from disk.

- If the union is nullable then at most one field is non-null and the field containing the union is optional
```
// Union<String, Integer, Boolean> (where the value of the union may be null)
// at most one of either String, Integer or Boolean is non-null
// if they are all null then the field my_union itself must be null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already stored in the definition levels of each branch as well though right? So there's no need to check even.

optional group my_union (UNION) {
optional binary string (UTF8);
optional int32 integer;
optional boolean bool;
}
```
The definition level of the UNION group is used to differentiate a null value (the union was null to start with) from a projection that excludes the non-null field.
If the Union group is null then the value was null.
If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand now, yes we can use this to tell the difference, but unfortunately we can't really tell the user what kind of branch this was, which I don't think is great.


- If - despite the spec - a group instance contains more than one non-null field the behavior is undefined and may change depending on the projection applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define this as it will throw an exception in parquet-core? It should be completely detectable, any reason to leave it undefined instead of explicitly fatal?


#### Projecting Unions
The following points are to be noted when projecting columns out of unions:
- At least one column from one of the branches must be included in the projection to know when the union is null or not.
- When projecting out some branches of the union, the type of the union is "unknown" for those at read time. Each object model integration (avro, thrift, ...) has its own rules to expose this.
- At least one column from each branch must be included in the projection to always know the type.
- The mechanism to filter records with "unknown" type (meaning these columns have been excluded from the projection) is defined by the model as well.
Find details about Thrift and Avro in their respective directory.

#### Mapping to Avro Unions
- an Avro Union that contains Null and at least two other types will map to an optional Parquet Union (of the remaining types).
- an Avro Union that does not contain null will map to a required Parquet Union.

## Null
Sometimes when discovering the schema of existing data values are always null and there's no type information.
The `NULL` type can be used to annotates a column that is always null.
(Similar to Null type in Avro)

12 changes: 12 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,18 @@ enum ConvertedType {
*/
INTERVAL = 21;

/**
* A Union type
*
* This type annotates data stored as a Group.
* This shows the intent to have heterogenous types under the same field name.
* The names of the fields in the annotated Group are not important in such a case.
* All fields of the Group must be optional and exactly one is defined for each instance of the group.
* If more than one is defined the behavior is undefined and may changed depending on the projection applied.
* An optional Union field encodes the difference between a null value and a missing projected-out non-null value.
*/
UNION = 24;

/**
* Annotates a column that is always null
* Sometimes when discovering the schema of existing data
Expand Down