Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-601: Add support to configure the encoding used by ValueWriters #342

Closed
wants to merge 16 commits into from

Conversation

piyushnarang
Copy link

Context:

Parquet is currently structured to choose the appropriate value writer based on the type of the column as well as the Parquet version. As of now, the writer(s) (and hence encoding) for each data type is hard coded in the Parquet source code.

This PR adds support for being able to override the encodings per type via config. That allows users to experiment with various encoding strategies manually as well as enables them to override the hardcoded defaults if they don't suit their use case.

We can override encodings per data type (int32 / int64 / ...).
Something on the lines of:

parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"

As an example:

"parquet.writer.encoding-override.int32" = "plain"
(Chooses Plain encoding and hence the PlainValuesWriter).

When a primary + fallback need to be specified, we can do the following:

"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
(Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY encoding as the fallback and hence creates a FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 

In such cases we can mandate that the first encoding listed must allow for Fallbacks by implementing RequiresFallback.

PR notes:

  • Restructured the ValuesWriter creation code. Pulled it out of ParquetProperties into a new class and refactored the flow based on type as it was getting hard to follow and I felt adding the overrides would make it harder. Added a bunch of unit tests to verify the ValuesWriter we create for combinations of type, parquet version and dictionary on / off.
  • Added unit tests to verify parsing of the encoding overrides + creation of ValuesWriters based on these overrides.
  • Manually tested some encoding overrides scenarios out on Hadoop (both parquet v1, v2).

@piyushnarang
Copy link
Author

@isnotinvain / @rdblue - please take a look..

@isnotinvain
Copy link
Contributor

just to record an offline discussion we just had:

I think the goal of this is more along the lines of creating an encoding selection strategy which gets to choose encodings / encoding implementations dynamically at runtime. Something like:

interface ValuesWriterSelectionStrategy {
  public ValueWriter getWriter(columnMetaData);
}

Then we can say

org.apache.parquet.value.writer.selection.strategy.int32=com.example.FooSelectionStrategy

Now our FooSelectionStrategy might itself be a factory of ValueWriters that implement fallback, or it might even be one that buffers the first N values and runs a heuristic on them, or it might "race" N different ValueWriters against each other and pick the best one at the end. Then, the parquet-core logic can ask the ValueWriter "so what Encoding did you wind up using" and that's what will go into the page metadata.

But what this allows is for us to write our own logic not only for "what type gets what encoding" but also "how do I change my mind about an encoding based on the data I'm seeing" (aka fallback) in a generic way that is tied to just dictionaries.

@hkothari
Copy link

Is the plan then that you would only be able to choose the encoding types via the Strategies that you mentioned? Or would there also be some sort of lightweight way to specify the encoding type explicitly as well for certain columns (as suggested by the initial comment).

I ask because this would be super useful to have in Spark and I can imagine tons of situations where information is known about the data beforehand and people would like to be able to explicitly specify the column encodings. That's not to say that the strategies wouldn't be useful (they would be), or that you couldn't jerryrig and explicit setting into a strategy but I think it would be useful to have explicit setting be possible in some first class way.

(If either of you guys are closer to the spark implementation, feel free to point out if this is irrelevant, but I know Spark parquet depends on the parquet-mr so I suspect whatever you guys do here will affect me if I want to include support in Spark)

@isnotinvain
Copy link
Contributor

Hi @hkothari,

So what I was thinking is, I'm not entirely convinced that it's a useful feature for users to be able to easily configure what encoding is used for what type of column, eg "use delta encoding for integers". The reason I say that is, a lot of the encodings depend on what's actually in the data (are the integers close together / sorted, or are they random ids?). And what happens when you've got 2 columns of the same type, but with different attributes (an int column that's sorted, and one that's random)? One way would be to let users choose an encoding per field, instead of per type. But I worry that this will spiral into way too much to reasonably configure.

Instead, I would rather we make the heuristics inside of parquet good enough that they can choose a good encoding on their own, as the data is being looked at, and, because our first set of heuristics will probably not be the best ones, we can let users bring their own strategy, though ideally we would be folding those strategies back into parquet-mr if they are good / better than what we have.

That said, as you pointed out, we could implement a constant strategy, that always picks the 1 encoding you asked it to, and we could make a shorthand for using that strategy as this PR initially planned to. Do you think that is a feature that you would still find useful, even if we have a decent automatic selection strategy? I think if we're going to do that, it should probably be per-field not per-type.

@hkothari
Copy link

I'm not proposing per column type. I don't think that's nearly as useful as per actual column. In a lot of the cases I've worked in, you either know certain columns will have certain distributions beforehand (I'm recieving this data sorted by the "purchase_date", so deltaencode my "purchase_date" column but not my other date columns, or I know one column is pretty clustered but has tons of distinct values so RLE it) or you don't.

In the unknown case a strategy makes sense. But in the known case which happens fairly often in my experience a strategy can be slower on writes (a metric people have complained about already for parquet) or in the case that something goes wrong it's just suboptimal. In those cases, it's helpful to be able to explicitly override to what you know is more optimal.

I'm open to doing this as an ExplicitStrategy or something, as long as it's flexible enough.

@piyushnarang
Copy link
Author

Thanks for chiming in @hkothari good to get some additional feedback :-). I like the idea of being able to explicitly specify the encoding for a given column type (int / bool / ...). One of the reasons (apart from the ones already discussed above) is that users could want to ensure they optimize for different variables apart from what we know at write time. For example you might want to ensure that your read path is not super expensive and that might conflict with the write side constraint of minimizing size on disk. We could possibly tackle this with a sophisticated WriteSelectionStrategy (that potentially accounts for this) but it might end up being easier to specify manually and prototype.

I'm a bit vary though of being able to override on a per column basis (rather than per type). It definitely is more accurate but what I've seen is that most of our datasets are really large and make this level of overriding painful. (It could conceivably be useful to others with smaller schemas though).

I was thinking of maybe using @isnotinvain 's idea with a possibility of allowing column type overrides.
Something on the lines of:

parquet.writer.selection.strategy.int32=com.example.ConfigSelectionStrategy
parquet.writer.encoding-override.int32=rle_dictionary,plain
...

ConfigSelectionStrategy basically can look up the actual encoding to use based on what is configured in parquet.writer.encoding-override.<type> and fallback to the default if it isn't configured there yet (similar to what was originally proposed in the PR). This could also be extended to per column overrides if needed (we'll just need to ensure that adequate information is passed to the ConfigSelectionStrategy so that it knows which column it's trying to create a writer for..

@isnotinvain
Copy link
Contributor

In order to keep the first pass of this PR simple and constrained, I think we should first build the selection strategy interface and a way to set it in the config, eg:

parquet.writer.encoding.selection.strategy=com.example.MyCoolStrategy

Just doing this part puts us in a good position to add more built-in strategies, like the manual-per-column strategy or the per-type strategy and so on.

As for whether to specify a single global strategy or a strategy per-type, I was initially thinking just one strategy that handles all types, but one strategy per type would also be fine.

@piyushnarang
Copy link
Author

@isnotinvain / @hkothari, I've updated the PR based on our discussion. Here's how things work now:

  • Created a ValuesWriterFactory interface that helps us create ValuesWriters.
  • Set up a default version of that interface with the refactored code from ParquetProperties to capture our current values writer instantiation logic. This is the writer in use by default when no override is configured.
  • Also added an interface to help pass in Hadoop config to the factory if needed (see the TestParquetOutputFormatFactoryOverrides unit tests for an example).
  • For more sophisticated logic, e.g. type column writer selection or per type writer selection overrides we can add config which the factory can then read to instantiate the appropriate values writers. Will follow up with a PR for that to achieve what I'd earlier implemented.

Here's how things can be set up:

parquet.writer.factory-override = "org.apache.parquet.hadoop.MyValuesWriterFactory"

This creates a factory, MyValuesWriterFactory which is then invoked for every ColumnDescriptor to get a ValueWriter.

@@ -97,21 +84,27 @@ public static WriterVersion fromString(String name) {
private final int maxRowCountForPageSizeCheck;
private final boolean estimateNextSizeCheck;
private final ByteBufferAllocator allocator;

private final int initialSlabSize;
private final ValuesWriterFactory valuesWriterFactory;

private ParquetProperties(WriterVersion writerVersion, int pageSize, int dictPageSize, boolean enableDict, int minRowCountForPageSizeCheck,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the story for configurable properties here? If one writes a custom ValuesWriterFactory, it seems totally reasonable that they would have other non-default settings that they would configure. I'm not really sure how this is supported in hadoop (if at all) but I would imagine it being something like fetching all settings under parquet.writer.writerProperties.* or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your question completely but if you take a look at my last commit: 503958a, you can see that there's a way to configure ValuesWriterFactories. To do so, you write your special ValuesWriterFactory (similar to what I've done in the unit tests) and make it extend ConfigurableFactory. When you do so, you have the Hadoop Config passed in which you can read & use. I did mull doing something like reading everything under parquet.writer.writerProperties.* but felt this was a cleaner approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yeah I totally missed ConfigurableFactory, that works perfectly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigurableFactory is now a factory that is Configurable, right?

@piyushnarang
Copy link
Author

@isnotinvain - updated based on your comments. Do take a look when you get the time.

* Due to this, they must provide a default constructor.
* Lifecycle of ValuesWriterFactories is:
* 1) Created via reflection while creating a {@link org.apache.parquet.column.ParquetProperties}
* 2) If the factory is Configurable (needs Hadoop conf), that is set, initialize is also called. This is done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lets clarify here that if the factory implements Configurable, it's setConfig method will be called. Just so the reader understands that to opt-in to getting the config they must extends Configurable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

@isnotinvain
Copy link
Contributor

+1 for me, one minor comment about clarifying some docs, but LGTM

@piyushnarang
Copy link
Author

@rdblue / @julienledem - do you guys have the time to take a look?

@piyushnarang
Copy link
Author

Thanks for taking a detailed look @isnotinvain :-)

ValuesWriterFactoryParams params =
new ValuesWriterFactoryParams(writerVersion, initialSlabSize, pageSizeThreshold, allocator,
enableDictionary, dictionaryPageSizeThreshold);
valuesWriterFactory = writerFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the convention is to use this.x when setting instance variable x.

@rdblue
Copy link
Contributor

rdblue commented Jul 27, 2016

I made a few comments, but I have two bigger issues as well:

First, ParquetProperties has evolved over time so that it is currently doing the work of both new abstractions introduced in this PR, the write properties and the factory. I think it originally was intended to manage the properties, but convenience methods attached to it eventually made it into the factory it is today. I don't see a good reason to replace both of those functions for values writers.

What about adding the reader methods from ValuesWriterFactoryParams to ParquetProperties instead? That would keep those settings in one place. We could also use the existing builder, which avoids a big public constructor that will change over time. Then we would have a method that configures the ValuesWriterFactory with a ParquetProperties instance. I think in the long term, we want to change so that the ValuesWriterFactory is used primarily, rather than what we do today.

Second, ValuesWriter is not part of the public API. I don't think we want it to be considered public. We don't support anyone's custom ValuesWriter implementations, which would be made possible by this commit. In order to accomplish what you guys want, testing fallback rules and different strategies for writers, we do need to make this possible, but we shouldn't consider it part of the public API.

That means we shouldn't make the config setting public or expose it through any public API. This makes more sense in a SPI. Maybe we should start a module for that? It would be good to document extension points as SPI interfaces, like the ColumnReadStore, PageReadStore, and ValuesWriterFactory.

@piyushnarang
Copy link
Author

@rdblue Thanks for taking a look. I can take a stab at refactoring things from the ValuesWriterFactoryParams to ParquetProperties. I also prefer the separation of ParquetProperties and the ValuesWriterFactory code. ParquetProperties was getting to hard to read and doing too many things at once.

I wasn't aware of us requiring that ValuesWriter are not to be part of the public API. I do however see value in us being able to configure strategies to help choose these writers - it helps users test out various encoding strategies manually and in the future we could also plug in sophisticated strategies that could pick column encodings in an automated fashion. We could set up an SPI module but we'd still not be able to configure which factory to use at runtime right? We could have some annotations that we could look up by reflection to see which ValuesWriterFactories there are and which one is the chosen one but that would need to be specified in code.

Shall try and think of any other potential options.

@isnotinvain
Copy link
Contributor

We don't want users to be able to provide their own column formats (like inventing a new storage format), but I thought the point of this PR is to allow users to plug in encoding selection strategies, including things that maybe change their mind mid encoding using heuristics or something. I don't know whether that should be part of the public API or not, parquet doesn't actually distinguish between public and private API yet as far as I know (it would be great if it did).

I think the most important thing is that parquet developers be able to easily swap in / test different encoding strategies. It's probably fine if the only way to do that is to fork parquet as long as you don't have to mess with tons of layers of plumbing, so keeping an API like this private seems fine because it still makes experimenting w/ encodings easier for parquet developers.

@rdblue
Copy link
Contributor

rdblue commented Jul 27, 2016

If you guys are happy keeping the API private, then I think that makes the most sense. Then there's no need for the reflection or extra options in the OutputFormat.

@piyushnarang
Copy link
Author

Yeah I don't mind going with that. I'll yank out the code to configure the ValuesWriterFactory via Hadoop config + creation with reflection. If folks want to override + test out other strategies they can implement their own ValuesWriterFactory and update their code to use their ValuesWriterFactory. Still a manual step but like Alex pointed out it should be fairly small.

@piyushnarang
Copy link
Author

@rdblue couple of questions on this:

  1. Currently the Configurable interface is present to allow folks to pass Hadoop config to the ValuesWriterFactory. It's not needed for the DefaultValuesWriterFactory but I was thinking of leaving it in so that the hooks are in place to easily pass config while testing out ad-hoc ValuesWriterFactories.
  2. I was looking at passing the ParquetProps to the ValuesWriterFactory and it seems a bit convoluted:
    In ParquetOutputFormat.getRecordWriter we do:
ParquetProperties props = ParquetProperties.builder()
   .withPageSize(..)
   ...
   .withValuesWriterFactory(factory)
   .build();                 

Now if we want to in turn pass the ParquetProps to the factory we need to do:
valuesWriterFactory.initialize(props);. The flow seems a bit convoluted to me. We're passing a semi initialized factory to ParquetProps and then passing the ParquetProps in turn to the factory.
The ValuesWriterFactoryParams is really just a struct with a bunch of values from the ParquetProps that we need in the ValuesWriterFactory to construct ValuesWriters. Seems cleaner to pass that rather than the approach above. Thoughts?

@rdblue
Copy link
Contributor

rdblue commented Aug 2, 2016

@piyushnarang, #1 sounds fine. For #2, you're just recreating ParquetProperties with a different name and making the existing one a useless wrapper. I get that you pass the factory in to the properties only for it to configure the factory. That's so that we can maintain backward-compatibility. In the future, we would remove the factory methods from ParquetProperties so that isn't needed. I think this is still better.

@piyushnarang
Copy link
Author

@rdblue sounds good, we can tackle decoupling the factory methods from ParquetProperties in the future. I've put out an update that addresses your comments. Do take a look when you get the time. Thanks!

if (factory instanceof Configurable) {
Configurable configurableFactory = (Configurable) factory;
configurableFactory.setConf(conf);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not set the hadoop configuration in a static member.
we could just create a new DefaultValuesWriterFactory every time instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly create method ParquetProperties.getValuesWriterFactory(Configuration conf)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that configuration happens by subclassing ParquetOutputFormat ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think creating a ParquetProperties.getValuesWriterFactory(Configuration conf) is possible cause ParquetProperties is in parquet-column (which doesn't depend on hadoop so we don't have Config there) so we'll have to do this in ParquetOutputFormat.
Configuration happens by creating a ValuesWriterFactory that implements Configurable:

public class MyConfigurableValuesWriterFactory implements ValuesWriterFactory, Configurable {
...
}

Now when we create a new ValuesWriterFactory in getValuesWriterFactory() via getRecordWriter(...) we pass the config object to the factory there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mind updating this to getValuesWriterFactory method be non-static.
Currently the DEFAULT_VALUES_WRITER_FACTORY is a static declared in ParquetProperties as Alex preferred that in one of the prior reviews but we can revisit that if you feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. since DefaultValuesWriterFactory does not implement Configurable maybe we just remove this if statement?
  2. my question on configuration was how you decide which ValuesWriterFactory to use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem Yeah so 1) that was one of the question I posed to Ryan above (if you see the PR notes). Copying it again here:

  1. Currently the Configurable interface is present to allow folks to pass Hadoop config to the ValuesWriterFactory. It's not needed for the DefaultValuesWriterFactory but I was thinking of leaving it in so that the hooks are in place to easily pass config while testing out ad-hoc ValuesWriterFactories. Ryan felt that it would be OK to leave it in place.

  2. The original approach was to configure the ValuesWriterFactory to use via Hadoop config. Something like:

parquet.writer.factory-override = "org.apache.parquet.hadoop.MyValuesWriterFactory"

In ParquetOutputFormat we were creating MyValuesWriterFactory by reflection and using that to create new ValuesWriters for various columns.
@rdblue wasn't keen on this as ValuesWriter is supposed to be a private class internal to Parquet so he didn't want us to be able to configure the ValuesWriterFactory. So we decided to yank the configuration part of it out and leave the basic plumbing in place. So right now if you wrote your own custom ValuesWriterFactory that you wanted to test out, you'd have to update your Parquet code base to use that ValuesWriterFactory (instead of the DefaultValuesWriterFactory) in ParquetProperties / ParquetOutputFormat. This is easier than what I was before (as then the values writer creation code was not decoupled from the ParquetProperties) but not as flexible as our PR proposal initially was (to be able to allow users to configure things).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @piyushnarang for bringing me up to speed :)

  1. We should remove that if statement since in the current PR it can never be true. It did make sense if you could configure a class name instantiated with newInstance() but right now it is just dead code.
  2. If we add a configuration in the future, it should be expressed in terms of encodings so that only valid Parquet files can be written. Once you are happy with your current experiment I think it will become clearer what the configuration would look like.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok, I can remove that piece of code. I'll add a comment in the ValuesWriterFactory interface to indicate that if someone wants to use Hadoop config in their factory they need to hook it up in ParquetOutputFormat it's not very obvious.

We've gone back and forth a bit on the desired approach.
My initial implementation expressed configuration in terms of encodings:

parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
"parquet.writer.encoding-override.int32" = "plain"

@isnotinvain suggested the reflection based approach as it was more flexible - can do manual experimentation as well as potentially automated encoding selection(see notes above).
Either this approach or the reflection based override method works for us right now (I have a subclass of ValuesWriterFactory that reads the encoding for a given type from config in a fork).

Anyway, I think for now we can go ahead with what we have right now(not exposing any of these in config) after I remove the Configurable code. This will help us break up some of the coupling in ParquetProperties. We could discuss which of these approaches are more appealing to various groups of Parquet users and I'd be happy to add a PR. Let me know if this sounds reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piyushnarang: merging the current approach sounds good to me.

type or column based configuration can be added in a follow up. I suspect that sometimes users might want to be able to force a specific encoding for a given column.

@piyushnarang
Copy link
Author

@julienledem - update the PR to remove the Configurable hook.

@isnotinvain
Copy link
Contributor

@rdblue and @julienledem there's been a lot of back and forth on this PR, are there any remaining issues? Thanks!

@julienledem
Copy link
Member

+1

@piyushnarang
Copy link
Author

Thanks @julienledem. I can spin up a different thread to discuss how we want to configure type / column based configuration. Any preferences on how / where? Email (parquet-dev@) / Github issue / jira?

@julienledem
Copy link
Member

@piyushnarang: JIRA is good.(we don't use github issues)

@julienledem
Copy link
Member

@piyushnarang @isnotinvain @rdblue good to go?

@piyushnarang
Copy link
Author

@julienledem Ok, I'll spin up a jira and cc you, Alex & Ryan.
Let me try out one quick test of this build on Hadoop to confirm. I'll get back in the next couple of hours.

@isnotinvain
Copy link
Contributor

+1

@piyushnarang
Copy link
Author

@julienledem Tested this out and I think it is good to go from my end.
I've already got a jira which captures the two approaches discussed on this PR for specifying overrides (pinged you folks on it) - https://issues.apache.org/jira/browse/PARQUET-601. Do chime in when you get a chance.

@isnotinvain
Copy link
Contributor

I will merge this, but I'd like a +1 / +0 / -0 from @rdblue first

@piyushnarang
Copy link
Author

Ping @rdblue - can you take a look?

@rdblue
Copy link
Contributor

rdblue commented Aug 9, 2016

Will do, sorry I missed it when Alex pinged me earlier

@rdblue
Copy link
Contributor

rdblue commented Aug 10, 2016

+1 Thanks, @piyushnarang!

@asfgit asfgit closed this in 30aa910 Aug 11, 2016
@HansBrende
Copy link
Member

@piyushnarang @isnotinvain @rdblue @hkothari
Hi. First of all, thank you for making this commit! It's very helpful for my use case: which is that I want to turn off dictionary encoding for individual columns in which I know values will not be repeated (or are seldom repeated), using more efficient encodings instead of the plain fallback encoding, while leaving dictionary encoding in place for columns where I know values will often be repeated.

However, it's rather hard to configure this correctly in the hadoop ParquetOutputFormat class, as the ParquetProperties used to configure the ParquetRecordWriter are never directly accessible to me, so I cannot modify the ValuesWriterFactory.

Instead, I have to copy and paste the entire ParquetOutputFormat class into my own custom class in order to modify the ValuesWriterFactory used to create the ParquetRecordWriter. But oh, the ParquetRecordWriter constructor has package-private access, so I have to use reflection to create the ParquetRecordWriter instance.

So... would it be possible to slightly modify the ParquetOutputFormat class so that I can less painfully specify a different ValuesWriterFactory? Ideally, I'd like to be able to just subclass ParquetOutputFormat and override a method or two.

Thoughts? Should I turn this into a new JIRA issue? This seems to be related to an existing one as well: https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-796.

@HansBrende
Copy link
Member

HansBrende commented Feb 25, 2018

(If PARQUET-796 is resolved, then that would also fix my use case, as then I wouldn't have to specify a different ValuesWriterFactory in the first place! I really like @rdblue 's suggestion to use dictionary encoding only if dictionary encoding is enabled AND the original type == OriginalType.ENUM. But I'd recommend taking it even one step further by ditching the parquet.enable.dictionary setting altogether and using dictionary encoding IF AND ONLY IF original type == OriginalType.ENUM (although in my use case, the often-repeated values won't be java enums, but urls). Also, when and how dictionary encoding is used should really be specified in the documentation--it seemed rather vague on that point.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants