Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,3 +366,7 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## Null
Sometimes when discovering the schema of existing data values are always null and there's no type information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used when an inferred type is null because a schema inference tool has only observed null values, right? Should this be a primitive type? My concern is that we may have a situation like this:

  1. Job 1 writes only nulls for column A as optional binary A (NULL);
  2. Job 2 writes mixed nulls and ints for column A as optional int32 A;
  3. Job 3 reads both files and chokes because it can't reconcile the schemas.

This may seem contrived at first because there is an easy merge rule, but in distributed use cases like MR it is common not to read the file schema on the client/driver. If there is no merge or requested schema, then it is possible for nodes to use the file schema, lose the NULL annotation, and die because string and int can't be merged during a shuffle.

Am I overthinking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. This is why arrow has a null type in the first place.
I'm wondering what is the impact of adding a primitive type.
The fact that the value is always null simplifies a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the merge logic stays the same in distributed mode as long as the full type is carried along and we know it's null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to make sure this works. I think our Schema union logic isn't currently based on logical types. For example, DECIMAL backed by binary merged with one backed by int64 probably fails. Probably not a big deal.

The `NULL` type can be used to annotates a column that is always null.
(Similar to Null type in Avro)
8 changes: 7 additions & 1 deletion src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,13 @@ enum ConvertedType {
* particular timezone or date.
*/
INTERVAL = 21;


/**
* Annotates a column that is always null
* Sometimes when discovering the schema of existing data
* values are always null
*/
NULL = 25;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this specify what underlying type to use for NULL?

}

/**
Expand Down