-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-601: Add support to configure the encoding used by ValueWriters #342
Conversation
@isnotinvain / @rdblue - please take a look.. |
just to record an offline discussion we just had: I think the goal of this is more along the lines of creating an encoding selection strategy which gets to choose encodings / encoding implementations dynamically at runtime. Something like:
Then we can say
Now our But what this allows is for us to write our own logic not only for "what type gets what encoding" but also "how do I change my mind about an encoding based on the data I'm seeing" (aka fallback) in a generic way that is tied to just dictionaries. |
Is the plan then that you would only be able to choose the encoding types via the Strategies that you mentioned? Or would there also be some sort of lightweight way to specify the encoding type explicitly as well for certain columns (as suggested by the initial comment). I ask because this would be super useful to have in Spark and I can imagine tons of situations where information is known about the data beforehand and people would like to be able to explicitly specify the column encodings. That's not to say that the strategies wouldn't be useful (they would be), or that you couldn't jerryrig and explicit setting into a strategy but I think it would be useful to have explicit setting be possible in some first class way. (If either of you guys are closer to the spark implementation, feel free to point out if this is irrelevant, but I know Spark parquet depends on the parquet-mr so I suspect whatever you guys do here will affect me if I want to include support in Spark) |
Hi @hkothari, So what I was thinking is, I'm not entirely convinced that it's a useful feature for users to be able to easily configure what encoding is used for what type of column, eg "use delta encoding for integers". The reason I say that is, a lot of the encodings depend on what's actually in the data (are the integers close together / sorted, or are they random ids?). And what happens when you've got 2 columns of the same type, but with different attributes (an int column that's sorted, and one that's random)? One way would be to let users choose an encoding per field, instead of per type. But I worry that this will spiral into way too much to reasonably configure. Instead, I would rather we make the heuristics inside of parquet good enough that they can choose a good encoding on their own, as the data is being looked at, and, because our first set of heuristics will probably not be the best ones, we can let users bring their own strategy, though ideally we would be folding those strategies back into parquet-mr if they are good / better than what we have. That said, as you pointed out, we could implement a constant strategy, that always picks the 1 encoding you asked it to, and we could make a shorthand for using that strategy as this PR initially planned to. Do you think that is a feature that you would still find useful, even if we have a decent automatic selection strategy? I think if we're going to do that, it should probably be per-field not per-type. |
I'm not proposing per column type. I don't think that's nearly as useful as per actual column. In a lot of the cases I've worked in, you either know certain columns will have certain distributions beforehand (I'm recieving this data sorted by the "purchase_date", so deltaencode my "purchase_date" column but not my other date columns, or I know one column is pretty clustered but has tons of distinct values so RLE it) or you don't. In the unknown case a strategy makes sense. But in the known case which happens fairly often in my experience a strategy can be slower on writes (a metric people have complained about already for parquet) or in the case that something goes wrong it's just suboptimal. In those cases, it's helpful to be able to explicitly override to what you know is more optimal. I'm open to doing this as an ExplicitStrategy or something, as long as it's flexible enough. |
Thanks for chiming in @hkothari good to get some additional feedback :-). I like the idea of being able to explicitly specify the encoding for a given column type (int / bool / ...). One of the reasons (apart from the ones already discussed above) is that users could want to ensure they optimize for different variables apart from what we know at write time. For example you might want to ensure that your read path is not super expensive and that might conflict with the write side constraint of minimizing size on disk. We could possibly tackle this with a sophisticated WriteSelectionStrategy (that potentially accounts for this) but it might end up being easier to specify manually and prototype. I'm a bit vary though of being able to override on a per column basis (rather than per type). It definitely is more accurate but what I've seen is that most of our datasets are really large and make this level of overriding painful. (It could conceivably be useful to others with smaller schemas though). I was thinking of maybe using @isnotinvain 's idea with a possibility of allowing column type overrides.
|
In order to keep the first pass of this PR simple and constrained, I think we should first build the selection strategy interface and a way to set it in the config, eg:
Just doing this part puts us in a good position to add more built-in strategies, like the manual-per-column strategy or the per-type strategy and so on. As for whether to specify a single global strategy or a strategy per-type, I was initially thinking just one strategy that handles all types, but one strategy per type would also be fine. |
@isnotinvain / @hkothari, I've updated the PR based on our discussion. Here's how things work now:
Here's how things can be set up:
This creates a factory, |
@@ -97,21 +84,27 @@ public static WriterVersion fromString(String name) { | |||
private final int maxRowCountForPageSizeCheck; | |||
private final boolean estimateNextSizeCheck; | |||
private final ByteBufferAllocator allocator; | |||
|
|||
private final int initialSlabSize; | |||
private final ValuesWriterFactory valuesWriterFactory; | |||
|
|||
private ParquetProperties(WriterVersion writerVersion, int pageSize, int dictPageSize, boolean enableDict, int minRowCountForPageSizeCheck, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the story for configurable properties here? If one writes a custom ValuesWriterFactory, it seems totally reasonable that they would have other non-default settings that they would configure. I'm not really sure how this is supported in hadoop (if at all) but I would imagine it being something like fetching all settings under parquet.writer.writerProperties.*
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your question completely but if you take a look at my last commit: 503958a, you can see that there's a way to configure ValuesWriterFactories. To do so, you write your special ValuesWriterFactory (similar to what I've done in the unit tests) and make it extend ConfigurableFactory. When you do so, you have the Hadoop Config passed in which you can read & use. I did mull doing something like reading everything under parquet.writer.writerProperties.*
but felt this was a cleaner approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, yeah I totally missed ConfigurableFactory, that works perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigurableFactory is now a factory that is Configurable, right?
@isnotinvain - updated based on your comments. Do take a look when you get the time. |
* Due to this, they must provide a default constructor. | ||
* Lifecycle of ValuesWriterFactories is: | ||
* 1) Created via reflection while creating a {@link org.apache.parquet.column.ParquetProperties} | ||
* 2) If the factory is Configurable (needs Hadoop conf), that is set, initialize is also called. This is done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe lets clarify here that if the factory implements Configurable, it's setConfig method will be called. Just so the reader understands that to opt-in to getting the config they must extends Configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
+1 for me, one minor comment about clarifying some docs, but LGTM |
@rdblue / @julienledem - do you guys have the time to take a look? |
Thanks for taking a detailed look @isnotinvain :-) |
ValuesWriterFactoryParams params = | ||
new ValuesWriterFactoryParams(writerVersion, initialSlabSize, pageSizeThreshold, allocator, | ||
enableDictionary, dictionaryPageSizeThreshold); | ||
valuesWriterFactory = writerFactory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the convention is to use this.x
when setting instance variable x
.
I made a few comments, but I have two bigger issues as well: First, What about adding the reader methods from Second, That means we shouldn't make the config setting public or expose it through any public API. This makes more sense in a SPI. Maybe we should start a module for that? It would be good to document extension points as SPI interfaces, like the ColumnReadStore, PageReadStore, and ValuesWriterFactory. |
@rdblue Thanks for taking a look. I can take a stab at refactoring things from the I wasn't aware of us requiring that Shall try and think of any other potential options. |
We don't want users to be able to provide their own column formats (like inventing a new storage format), but I thought the point of this PR is to allow users to plug in encoding selection strategies, including things that maybe change their mind mid encoding using heuristics or something. I don't know whether that should be part of the public API or not, parquet doesn't actually distinguish between public and private API yet as far as I know (it would be great if it did). I think the most important thing is that parquet developers be able to easily swap in / test different encoding strategies. It's probably fine if the only way to do that is to fork parquet as long as you don't have to mess with tons of layers of plumbing, so keeping an API like this private seems fine because it still makes experimenting w/ encodings easier for parquet developers. |
If you guys are happy keeping the API private, then I think that makes the most sense. Then there's no need for the reflection or extra options in the OutputFormat. |
Yeah I don't mind going with that. I'll yank out the code to configure the ValuesWriterFactory via Hadoop config + creation with reflection. If folks want to override + test out other strategies they can implement their own ValuesWriterFactory and update their code to use their ValuesWriterFactory. Still a manual step but like Alex pointed out it should be fairly small. |
@rdblue couple of questions on this:
Now if we want to in turn pass the ParquetProps to the factory we need to do: |
@piyushnarang, #1 sounds fine. For #2, you're just recreating ParquetProperties with a different name and making the existing one a useless wrapper. I get that you pass the factory in to the properties only for it to configure the factory. That's so that we can maintain backward-compatibility. In the future, we would remove the factory methods from ParquetProperties so that isn't needed. I think this is still better. |
@rdblue sounds good, we can tackle decoupling the factory methods from ParquetProperties in the future. I've put out an update that addresses your comments. Do take a look when you get the time. Thanks! |
if (factory instanceof Configurable) { | ||
Configurable configurableFactory = (Configurable) factory; | ||
configurableFactory.setConf(conf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not set the hadoop configuration in a static member.
we could just create a new DefaultValuesWriterFactory every time instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly create method ParquetProperties.getValuesWriterFactory(Configuration conf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that configuration happens by subclassing ParquetOutputFormat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think creating a ParquetProperties.getValuesWriterFactory(Configuration conf)
is possible cause ParquetProperties is in parquet-column (which doesn't depend on hadoop so we don't have Config there) so we'll have to do this in ParquetOutputFormat.
Configuration happens by creating a ValuesWriterFactory that implements Configurable:
public class MyConfigurableValuesWriterFactory implements ValuesWriterFactory, Configurable {
...
}
Now when we create a new ValuesWriterFactory in getValuesWriterFactory()
via getRecordWriter(...)
we pass the config object to the factory there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mind updating this to getValuesWriterFactory method be non-static.
Currently the DEFAULT_VALUES_WRITER_FACTORY
is a static declared in ParquetProperties as Alex preferred that in one of the prior reviews but we can revisit that if you feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- since DefaultValuesWriterFactory does not implement Configurable maybe we just remove this if statement?
- my question on configuration was how you decide which ValuesWriterFactory to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem Yeah so 1) that was one of the question I posed to Ryan above (if you see the PR notes). Copying it again here:
-
Currently the Configurable interface is present to allow folks to pass Hadoop config to the ValuesWriterFactory. It's not needed for the DefaultValuesWriterFactory but I was thinking of leaving it in so that the hooks are in place to easily pass config while testing out ad-hoc ValuesWriterFactories. Ryan felt that it would be OK to leave it in place.
-
The original approach was to configure the ValuesWriterFactory to use via Hadoop config. Something like:
parquet.writer.factory-override = "org.apache.parquet.hadoop.MyValuesWriterFactory"
In ParquetOutputFormat we were creating MyValuesWriterFactory by reflection and using that to create new ValuesWriters for various columns.
@rdblue wasn't keen on this as ValuesWriter is supposed to be a private class internal to Parquet so he didn't want us to be able to configure the ValuesWriterFactory. So we decided to yank the configuration part of it out and leave the basic plumbing in place. So right now if you wrote your own custom ValuesWriterFactory that you wanted to test out, you'd have to update your Parquet code base to use that ValuesWriterFactory (instead of the DefaultValuesWriterFactory) in ParquetProperties / ParquetOutputFormat. This is easier than what I was before (as then the values writer creation code was not decoupled from the ParquetProperties) but not as flexible as our PR proposal initially was (to be able to allow users to configure things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @piyushnarang for bringing me up to speed :)
- We should remove that if statement since in the current PR it can never be true. It did make sense if you could configure a class name instantiated with newInstance() but right now it is just dead code.
- If we add a configuration in the future, it should be expressed in terms of encodings so that only valid Parquet files can be written. Once you are happy with your current experiment I think it will become clearer what the configuration would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok, I can remove that piece of code. I'll add a comment in the ValuesWriterFactory
interface to indicate that if someone wants to use Hadoop config in their factory they need to hook it up in ParquetOutputFormat
it's not very obvious.
We've gone back and forth a bit on the desired approach.
My initial implementation expressed configuration in terms of encodings:
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
"parquet.writer.encoding-override.int32" = "plain"
@isnotinvain suggested the reflection based approach as it was more flexible - can do manual experimentation as well as potentially automated encoding selection(see notes above).
Either this approach or the reflection based override method works for us right now (I have a subclass of ValuesWriterFactory that reads the encoding for a given type from config in a fork).
Anyway, I think for now we can go ahead with what we have right now(not exposing any of these in config) after I remove the Configurable
code. This will help us break up some of the coupling in ParquetProperties. We could discuss which of these approaches are more appealing to various groups of Parquet users and I'd be happy to add a PR. Let me know if this sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piyushnarang: merging the current approach sounds good to me.
type or column based configuration can be added in a follow up. I suspect that sometimes users might want to be able to force a specific encoding for a given column.
@julienledem - update the PR to remove the Configurable hook. |
@rdblue and @julienledem there's been a lot of back and forth on this PR, are there any remaining issues? Thanks! |
+1 |
Thanks @julienledem. I can spin up a different thread to discuss how we want to configure type / column based configuration. Any preferences on how / where? Email (parquet-dev@) / Github issue / jira? |
@piyushnarang: JIRA is good.(we don't use github issues) |
@piyushnarang @isnotinvain @rdblue good to go? |
@julienledem Ok, I'll spin up a jira and cc you, Alex & Ryan. |
+1 |
@julienledem Tested this out and I think it is good to go from my end. |
I will merge this, but I'd like a +1 / +0 / -0 from @rdblue first |
Ping @rdblue - can you take a look? |
Will do, sorry I missed it when Alex pinged me earlier |
+1 Thanks, @piyushnarang! |
@piyushnarang @isnotinvain @rdblue @hkothari However, it's rather hard to configure this correctly in the hadoop Instead, I have to copy and paste the entire So... would it be possible to slightly modify the Thoughts? Should I turn this into a new JIRA issue? This seems to be related to an existing one as well: https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-796. |
(If PARQUET-796 is resolved, then that would also fix my use case, as then I wouldn't have to specify a different |
Context:
Parquet is currently structured to choose the appropriate value writer based on the type of the column as well as the Parquet version. As of now, the writer(s) (and hence encoding) for each data type is hard coded in the Parquet source code.
This PR adds support for being able to override the encodings per type via config. That allows users to experiment with various encoding strategies manually as well as enables them to override the hardcoded defaults if they don't suit their use case.
We can override encodings per data type (int32 / int64 / ...).
Something on the lines of:
As an example:
When a primary + fallback need to be specified, we can do the following:
In such cases we can mandate that the first encoding listed must allow for Fallbacks by implementing RequiresFallback.
PR notes: