-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genererating schemas from arbitrary map[string]interface{} (parquet, avro) #1353
Comments
Hey @loicalleyne, there's potential here but we'd need to be very specific about how we handle a range of edge cases. |
On a high level I'm thinking the first step is that output writer would need a new configuration option for schema where the writer will accept a schema passed by the input or processor preceding it. |
On a high level I'm thinking the first step is that output writer would need a new configuration option for schema where the writer will accept a schema passed by the input or processor preceding it. |
I wrote a proof of concept that can take a map and output what I think is compliant YAML schema for the parquet encoder and also the beginning of a Avro schema to Parquet schema converter; looks ok for primitive types, logical type support would need some more work but as far as I can make out I don't think they're supported in the encoder yet. Would appreciate your feedback if you have time to look at it. |
Without looking at the code, just curious, how are you planning to handle cases when you need to infer the schema for something like this: {
"foo": [
1,
{"bar": {"x": "y"}},
2,
"z"
]
}
Yeah, the supported types are listed here: https://www.benthos.dev/docs/guides/bloblang/methods#type. Message objects have the |
As far as I know Parquet only supports a union of ["primitive_type", "null"] or ["null", "primitive_type"] |
As an example of what gets inferred from an arbitrary map, here are the input and output examples:
Output:
|
Just saw 07ed81b, I'd like to work on enabling parquet output without having to specify a schema in the config file - I was wondering: If input messages are structured, is the structure guaranteed to stay the same across all messages from that input? Does it depend on the input source? Is it preferable to get an AVRO or other schema from the input and pass that as metadata to be dynamically converted to parquet schema? If so, in the case where there's a Bloblang processor, would the metadata of the input be mutated to add new mappings or to change field type? My standalone parquet YAML schema generator proof of concept tries to cover both the arbitrary map and the defined schema scenarios, which one dovetails better with the way Benthos works? |
When reading from a source (ie. avro-ocf, parquet) where the reader outputs a map[string]interface{} it would be great if it wasn't necessary to redefine the exit schema but instead convert the input schema to the equivalent schema for the output writer.
Could this be done by iterating over the map and using type assertions to assemble the schema for different writers? And then perhaps configurable filters/regex for mapping field names+primitive type combinations to logical types (ie. field-name: (event[[:graph:]]*) type:INT64 logital-type:TIMESTAMP unit:MILLIS).
Use case brainstorm:
data stream sinks to object storage
transferring OLTP DB data to OLAP DB using federated tables (ie. BigQuery external tables)
converting from row-based to column-based format
The text was updated successfully, but these errors were encountered: