Skip to content

Add ZSTD support for writing ORC and DWRF tables#14113

Merged
mbasmanova merged 1 commit intoprestodb:masterfrom
jainxrohit:rj_zstdjni_write
Mar 14, 2020
Merged

Add ZSTD support for writing ORC and DWRF tables#14113
mbasmanova merged 1 commit intoprestodb:masterfrom
jainxrohit:rj_zstdjni_write

Conversation

@jainxrohit
Copy link
Contributor

@jainxrohit jainxrohit commented Feb 18, 2020

Add ZSTD support for writing ORC and DWRF tables
To enable ZSTD compression, use session property hive.compression_codec='ZSTD'.

== RELEASE NOTES ==

Hive Changes

  • Add ZSTD support for writing ORC and DWRF tables. To enable ZSTD compression, use session property hive.compression_codec='ZSTD'.

@jainxrohit jainxrohit changed the title Add ZSTD support for Hive Compression codec [WIP] Add ZSTD support for Hive Compression codec Feb 26, 2020
@mbasmanova
Copy link
Contributor

It would be very helpful to have an option of writing ZSTD-compressed ORC data. To move this forward, I suggest add fail-fast logic for queries that attempt to write ZSTD compressed Parquet files by introducing a check in HiveWritableTableHandle's constructor:

        if (!compressionCodec.isSupportedStorageFormat(tableStorageFormat)) {
            throw new PrestoException(GENERIC_USER_ERROR, String.format("%s compression is not supported with %s", compressionCodec.name(), tableStorageFormat.name()));
        }
        if (!compressionCodec.isSupportedStorageFormat(partitionStorageFormat)) {
            throw new PrestoException(GENERIC_USER_ERROR, String.format("%s compression is not supported with %s", compressionCodec.name(), partitionStorageFormat.name()));
        }

This check is supported by a new method in HiveCompressionCodec enum:

public enum HiveCompressionCodec
{
    NONE(null, CompressionKind.NONE, CompressionCodecName.UNCOMPRESSED, f -> true),
    SNAPPY(SnappyCodec.class, CompressionKind.SNAPPY, CompressionCodecName.SNAPPY, f -> true),
    GZIP(GzipCodec.class, CompressionKind.ZLIB, CompressionCodecName.GZIP, f -> true),
    ZSTD(null, CompressionKind.ZSTD, null, f -> f != HiveStorageFormat.PARQUET);

    private final Optional<Class<? extends CompressionCodec>> codec;
    private final CompressionKind orcCompressionKind;
    private final CompressionCodecName parquetCompressionCodec;
    private final Predicate<HiveStorageFormat> supportedStorageFormats;

    HiveCompressionCodec(
            Class<? extends CompressionCodec> codec,
            CompressionKind orcCompressionKind,
            CompressionCodecName parquetCompressionCodec,
            Predicate<HiveStorageFormat> supportedStorageFormats)
    {
        this.codec = Optional.ofNullable(codec);
        this.orcCompressionKind = requireNonNull(orcCompressionKind, "orcCompressionKind is null");
        this.parquetCompressionCodec = parquetCompressionCodec;
        this.supportedStorageFormats = requireNonNull(supportedStorageFormats, "supportedStorageFormats is null");
    }

    public Optional<Class<? extends CompressionCodec>> getCodec()
    {
        return codec;
    }

    public CompressionKind getOrcCompressionKind()
    {
        return orcCompressionKind;
    }

    public CompressionCodecName getParquetCompressionCodec()
    {
        return parquetCompressionCodec;
    }

    public boolean isSupportedStorageFormat(HiveStorageFormat storageFormat)
    {
        return supportedStorageFormats.test(storageFormat);
    }
}

TestHivePageSink would use the same check: if (codec == NONE || !codec.isSupportedStorageFormat(format)) {

Finally, I'd add ZSTD to OrcTester to make sure we get good test coverage.

@highker @wenleix @arhimondr @jainxrohit @bhhari Thoughts?

@jainxrohit jainxrohit closed this Mar 11, 2020
@jainxrohit jainxrohit reopened this Mar 11, 2020
@jainxrohit jainxrohit requested a review from mbasmanova March 11, 2020 23:15
@jainxrohit jainxrohit changed the title [WIP] Add ZSTD support for Hive Compression codec [WIP] Add ZSTD support for write Mar 11, 2020
@mbasmanova mbasmanova self-assigned this Mar 11, 2020
@jainxrohit jainxrohit changed the title [WIP] Add ZSTD support for write Add ZSTD support for write Mar 11, 2020
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova
Copy link
Contributor

@jainxrohit There is a typo in commit message: prism.compression_codec -> hive.compression_codec . Same for PR description. Also, it would be nice to add a release note.

@jainxrohit
Copy link
Contributor Author

@jainxrohit There is a typo in commit message: prism.compression_codec -> hive.compression_codec . Same for PR description. Also, it would be nice to add a release note.

@mbasmanova I will have to check this. In my local testing hive.compression_codec does not work, only prism.compression_codec works.

@zhenxiao
Copy link
Collaborator

looks good to me. We will add ZSTD for Parquet later :)

@jainxrohit
Copy link
Contributor Author

@jainxrohit There is a typo in commit message: prism.compression_codec -> hive.compression_codec . Same for PR description. Also, it would be nice to add a release note.

@mbasmanova I will have to check this. In my local testing hive.compression_codec does not work, only prism.compression_codec works.

I have fixed it.

@mbasmanova
Copy link
Contributor

@jainxrohit Commit message contains some duplicate sentences. Perhaps, update to something like:

Add ZSTD support for writing ORC and DWRF tables

To enable ZSTD compression, use session property hive.compression_codec='ZSTD'.

To enable ZSTD compression, use session property hive.compression_codec='ZSTD'.
@jainxrohit jainxrohit changed the title Add ZSTD support for write Add ZSTD support for writing ORC and DWRF tables Mar 12, 2020
NONE(null, CompressionKind.NONE, CompressionCodecName.UNCOMPRESSED),
SNAPPY(SnappyCodec.class, CompressionKind.SNAPPY, CompressionCodecName.SNAPPY),
GZIP(GzipCodec.class, CompressionKind.ZLIB, CompressionCodecName.GZIP);
NONE(null, CompressionKind.NONE, CompressionCodecName.UNCOMPRESSED, f -> true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about format -> true ? ditto for other lambdas.

@mbasmanova
Copy link
Contributor

@wenleix Wenlei, do you have any further comments? Otherwise, I'll merge.

@mbasmanova mbasmanova merged commit 5d1797c into prestodb:master Mar 14, 2020
@caithagoras caithagoras mentioned this pull request Mar 29, 2020
9 tasks
@caithagoras caithagoras mentioned this pull request May 4, 2020
34 tasks
@jainxrohit jainxrohit deleted the rj_zstdjni_write branch May 7, 2020 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants