Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion api/src/main/java/org/apache/iceberg/Schema.java
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,16 @@

/**
* The schema of a data table.
* <p>
* Schema ID will only be populated when reading from/writing to table metadata,
* otherwise it will be default to 0.
*/
public class Schema implements Serializable {
private static final Joiner NEWLINE = Joiner.on('\n');
private static final String ALL_COLUMNS = "*";

private final StructType struct;
private final int schemaId;
private transient BiMap<String, Integer> aliasToId = null;
private transient Map<Integer, NestedField> idToField = null;
private transient Map<String, Integer> nameToId = null;
Expand All @@ -54,6 +58,7 @@ public class Schema implements Serializable {
private transient Map<Integer, String> idToName = null;

public Schema(List<NestedField> columns, Map<String, Integer> aliases) {
this.schemaId = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a constant for the default schema ID. That way we can reference it in other places when we fill in the default.

this.struct = StructType.of(columns);
this.aliasToId = aliases != null ? ImmutableBiMap.copyOf(aliases) : null;

Expand All @@ -62,12 +67,21 @@ public Schema(List<NestedField> columns, Map<String, Integer> aliases) {
}

public Schema(List<NestedField> columns) {
this(0, columns);
}

public Schema(int schemaId, List<NestedField> columns) {
this.schemaId = schemaId;
this.struct = StructType.of(columns);
lazyIdToName();
}

public Schema(NestedField... columns) {
this(Arrays.asList(columns));
this(0, Arrays.asList(columns));
}

public Schema(int schemaId, NestedField... columns) {
this(schemaId, Arrays.asList(columns));
}

private Map<Integer, NestedField> lazyIdToField() {
Expand Down Expand Up @@ -105,6 +119,16 @@ private Map<Integer, Accessor<StructLike>> lazyIdToAccessor() {
return idToAccessor;
}

/**
* Returns the schema ID for this schema.
* <p>
* Note that schema ID will only be populated when reading from/writing to table metadata,
* otherwise it will be default to 0.
*/
public int schemaId() {
return this.schemaId;
}

/**
* Returns an alias map for this schema, if set.
* <p>
Expand Down
15 changes: 15 additions & 0 deletions api/src/main/java/org/apache/iceberg/types/TypeUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
.fields());
}

/**
* Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: function should be outside the bracket

Copy link
Contributor Author

@yyanyy yyanyy Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I think either way may make sense? We either link nextId function or nextId to NextId interface, and I think both are valid interpretations? I'm also a bit hesitant to change since I copied this from another method, and there are a few others that use this style.

*
* @param schemaId an ID assigned to this schema
* @param schema a schema
* @param nextId an id assignment function
* @return a structurally identical schema with new ids assigned by the nextId function
*/
public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId) {
return new Schema(schemaId, TypeUtil
.visit(schema.asStruct(), new AssignFreshIds(nextId))
.asNestedType()
.fields());
}

/**
* Assigns ids to match a given schema, and fresh ids from the {@link NextID nextId function} for all other fields.
*
Expand Down
24 changes: 24 additions & 0 deletions core/src/main/java/org/apache/iceberg/SchemaParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ public class SchemaParser {
private SchemaParser() {
}

private static final String SCHEMA_ID = "schema-id";
private static final String TYPE = "type";
private static final String STRUCT = "struct";
private static final String LIST = "list";
Expand All @@ -58,9 +59,17 @@ private SchemaParser() {
private static final String VALUE_REQUIRED = "value-required";

static void toJson(Types.StructType struct, JsonGenerator generator) throws IOException {
toJson(struct, null, generator);
}

static void toJson(Types.StructType struct, Integer schemaId, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the struct toJson methods aren't called outside of this file. What do you think about making them private?

generator.writeStartObject();

generator.writeStringField(TYPE, STRUCT);
if (schemaId != null) {
generator.writeNumberField(SCHEMA_ID, schemaId);
}

generator.writeArrayFieldStart(FIELDS);
for (Types.NestedField field : struct.fields()) {
generator.writeStartObject();
Expand Down Expand Up @@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
}
}

public static void toJsonWithId(Schema schema, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in TableMetadataParser and I don't see why there should be two toJson methods. It isn't a problem for the v1 metadata to be written with the schema ID because extra fields are ignored. So I think that this method isn't needed and toJson should always pass the ID to the struct writer.

toJson(schema.asStruct(), schema.schemaId(), generator);
}

public static void toJson(Schema schema, JsonGenerator generator) throws IOException {
toJson(schema.asStruct(), generator);
}
Expand Down Expand Up @@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
}
});
}

public static Schema fromJsonWithId(int schemaId, JsonNode json) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if all of the parsing methods returned the correct ID instead of having some that use a version that ignores the ID.

I think you chose to use two methods because there are some situations where the ID wasn't written and you don't want the schema to have the wrong ID. But, because the fromJson method uses the Schema constructor that does not pass the ID, the schemas will have the default ID, 0, anyway.

As long as the schemas have an ID, I think there is no need to have multiple parser methods, both with and without ID. Also, we need to make sure that the IDs are consistent when reading the schema from files that were written before IDs. That should just be manifest files, where the schema is embedded in the header.

For manifests, we need to add the schema ID when writing (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L190). And when reading, we will need to update how the spec is constructed. If the spec is loaded by ID then it's fine. Otherwise we should load the schema by ID to parse the spec, and if that doesn't work we should parse both the schema and the spec (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L115-L120). I think we should probably also try to take the parsed schema and find the correct one from table metadata so that we don't have an incorrect ID anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed fromJsonWithId and update fromJson to use the constructor with schema Id, if Id present in Json blob, or default to constructor that doesn't require Id. Please let me know if it is expected!

Regarding persisting schema ID in manifest, since schema and partition-spec in both v1 and v2 are required in manifest metadata, I think it is unlikely that we will miss schema itself but has schema ID for lookup? And I think in manifest reader, schema is only used for constructing partition spec, so that having a correct schema ID might not be super necessary.

Today there are a lot of places that rely on toJson/fromJson for schema, e.g. Avro metadata, Hive table properties, scan tasks. In this PR, except for table metadata parser, those are all using the toJson version that do not write schemaId at all, which means that when the strings are read back to Schema object they will always have id of 0; and this goes back to the persisting schema ID question above, that we do not persist id in the manifest header, so we are not persisting an incorrect ID, but instead constructed a schema without ID that default to be 0 like everywhere else.

My original thinking was that schema Id should always be ignored/considered inaccurate unless we directly get it from table metadata, as after all kinds of projections it might be hard to track if the ID is correct. And the ID is only used for lookup, so if we have the schema struct, it's fine to have incorrect ID since we are unlikely to use it. But I might be overcomplicating the problem. Do you think we are able to guarantee the ID to be right everywhere, and do you have comment about the current state of toJson variations (that only the one used by metadata parser writes out ID)?

Type type = typeFromJson(json);
Preconditions.checkArgument(type.isNestedType() && type.asNestedType().isStructType(),
"Cannot create schema, not a struct type: %s", type);
return new Schema(schemaId, type.asNestedType().asStructType().fields());
}

public static Schema fromJsonWithId(JsonNode json) {
return fromJsonWithId(JsonUtil.getInt(SCHEMA_ID, json), json);
}
}
Loading