Skip to content

Optimize Hive Connector ORC Readers and Filter using Java Vector API#21670

Open
AbhijitKulkarni1 wants to merge 12 commits intoprestodb:masterfrom
intel-staging:vectorization-v1
Open

Optimize Hive Connector ORC Readers and Filter using Java Vector API#21670
AbhijitKulkarni1 wants to merge 12 commits intoprestodb:masterfrom
intel-staging:vectorization-v1

Conversation

@AbhijitKulkarni1
Copy link
Contributor

@AbhijitKulkarni1 AbhijitKulkarni1 commented Jan 10, 2024

Description

Enable Presto to be compatible with JDK 21.0.3 LTS version. Related to #16268
Optimize Hive connector to harness data parallelism in ORC Readers and Filters using Java Vector API

Motivation and Context

This PR enables Presto to be used with latest features available in JDK 21.0.3 LTS version like Java Vector API and Foreign API updates.

Vectorization of ORC Readers:

Presto Hive Connector for Presto ORC module is optimized with vectorized Reader to harness data parallelism using Java Vector API. Hive Connector is updated to provide hive property "orc-use-vector-filter" to use the optimized Presto ORC to read ORC format data in vectorized way.
Following are the scope of optimizations in this PR to start with:

  1. Optimized Double data type ORC readers to read and filter Uncompressed ORC data in vectorized way. Other data types will follow in future.
  2. Optimization is applicable for Uncompressed, Unencrypted ORC data read and filter. For Compressed and Encrypted data support, logic of processing data in OrcInputStream need to be simplified to decouple uncompressed and compressed data processing.
  3. Optimization is applicable for first level filter where we have contiguous positions available for filtering and is very appropriate for vectorization.

Class Design:
Presto Double Filter:
Filter_Class_Diagram

Class DoubleRange implements the interface TupleDomainFilter having the scalar version of filtering double values for certain range as part of testDouble() method.
DoubleRangeVector class is a newly introduced class for implementation of new interface TupleDomainFilterVector, providing vectorized implementation of filtering double values for certain range as part of testDoubleVector() method.

Presto Double Reader:
VectorizedDoubleReaderAndFIlter_ClassDiagram

DoubleSelectiveStreamReader class is the scalar reading implementation where double values are read one after another and testDouble() method is called for filtering.
As shown in above figure, the vectorization design extends the DoubleSelectiveStreamReader class with a new DoubleSelectiveStreamVectorReader class and further provides interfaces to write vectorized reading through a new DoubleVectorSelectiveReadAndFilter interface.
The implementation of this interface is the DoubleVectorSelectiveReadAndFilterSpecies512 class, which provides the vectorized reading for bulk data needed for vectorized filtering using testDoubleVector() method.

This design is crafted with the solution's extensibility in mind to provide implementations for other AVX bit width too, just by implementing the DoubleVectorSelectiveReadAndFilter interface.

Note the following points about this optimization.

  1. Updates are compatible with JDK 21.0.3 version. So use jdk 21.0.3 for compilation and execution of Presto binary.

  2. The JDK 21.0.3 compilation for vector updates in presto-common, presto-orc modules are handled using maven profiles. The updates are backward compatible with the lower version JDK compilation (JDK 8 and 11) and do not impact at runtime with lower JRE execution.

  3. Note the following points for compilation and launch of Presto using JDK 21.0.3

    i. Compilation:

export MAVEN_OPTS="--add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.text=ALL-UNNAMED --add-opens java.base/java.awt.font=ALL-UNNAMED --add-opens java.desktop/java.awt.font=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-modules=jdk.incubator.vector --enable-preview"
./mvnw clean install -DskipTests

ii. Runtime in Presto "etc\jvm.config" file set the following jvm options

--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
-Djdk.reflect.useDirectMethodHandle=false
--add-modules=jdk.incubator.vector
--enable-preview

Hive Connector Changes
Addition of hive property "orc-use-vector-filter" to enable/disable the use of vector ORC readers in Hive connector. If set to "true" in hive.properties file (hive.orc-use-vector-filter="true"), new vectorized readers and filters are used for compilation/execution. Default being "false" for current non vectorized flow.

Impact

JDK21 specific maven profiles are added to update the versions of dependencies/plugins as required for Presto to be compatible with JDK 21.0.3 LTS version. These updates do not impact the default compilation with JDK8 or 11 which is officially supported.

No functional impact for execution of Presto with JDK8 or 11 which are officially supported.
Presto execution with JDK 21.0.3, Improved Hive Connector for ORC Readers and Filters with vectorization using Java Vector API for Uncompressed ORC data reading and filtering.
Please find the claims of throughput (queries execution per unit time) improvement as below:

PrestoClaims2

  1. Throughput of Presto execution with JDK 21.0.3 LTS version is ~1.24X higher than the execution of Unmodified Presto with JDK 11.0.22
  2. Throughput of Vector Modified Presto execution with JDK 21.0.3 LTS version is ~1.38X (~1.24X from JDK 21.0.3 LTS version and ~1.12X from vectorization updates) higher than Unmodified Presto execution with JDK 11.0.22.
    Observed above data points on SF100 Uncompressed TPCH data in ORC format executed with Presto on one of the socket of m7i.metal-48xl aws bare metal machine.
    Query Used for Evaluation is similar to q01 and q06 filter queries of TPCH. Query used to evaluate the stress filtering of double data is:
SELECT COUNT(l.extendedprice) FROM hive.sf100.lineitem AS l WHERE  l.extendedprice < 20000;

Throughput is measured for 100 concurrent executions of the query.

Test Plan

Test all smoke test suites
Added unit tests to verify the vectorization updates and performed test coverage.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==
General Changes
* Update Presto to be compatible with JDK 21.0.3 LTS version :pr: `21760`

Hive Connector Changes
* Add configuration property  ``orc-use-vector-filter`` to enable or disable the use of vector ORC readers :pr: `21760`

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 10, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@tdcmeehan tdcmeehan self-assigned this Jan 11, 2024
@AbhijitKulkarni1 AbhijitKulkarni1 changed the base branch from master to release-0.283-edge3 January 11, 2024 19:00
@AbhijitKulkarni1 AbhijitKulkarni1 changed the base branch from release-0.283-edge3 to master January 11, 2024 19:00
@tdcmeehan
Copy link
Contributor

@AbhijitKulkarni1 this would be a very nice and exciting change. I am wondering if you need any help?

@steveburnett
Copy link
Contributor

Suggest a change to the current release notes entry: perhaps this draft describes the effect of this PR?

== RELEASE NOTES ==
General Changes
* Improve Presto compatibility with JDK 19.0.2. 

Hive Connector Changes
* Add configuration property  `orc-use-vector-filter` to enable or disable the use of vector ORC readers.

@AbhijitKulkarni1
Copy link
Contributor Author

@tdcmeehan Thanks for reaching out. We are in process of internal data review and would be able to submit PR for review shortly. Will surely connect with you in case I need any help.

@steveburnett
Copy link
Contributor

Two items:

  • Please add documentation for orc-use-vector-filter - probably in Hive Configuration Properties.

  • Suggest a change to the current release notes entry: perhaps this draft describes the effect of this PR?

== RELEASE NOTES ==
General Changes
* Improve Presto compatibility with JDK 19.0.2 :pr: `21760`

Hive Connector Changes
* Add configuration property  ``orc-use-vector-filter`` to enable or disable the use of vector ORC readers :pr: `21760`

@AbhijitKulkarni1
Copy link
Contributor Author

@steveburnett Can you please let me know about the second point? Do you mean to add one liner release notes and move the details in other section?

  • Suggest a change to the current release notes entry: perhaps this draft describes the effect of this PR?

@steveburnett
Copy link
Contributor

Good question, @AbhijitKulkarni1!

I suggest that each release note entry should be a summary of the work done in the PR. The summary version allows a reader of the new Release Notes to quickly identify if an entry is something they need more detail on. If not, they can move on to the next entry. You could examine the recent release notes, for example the 0.287 Release Notes, for examples.

Because each release note entry ends with a link to the PR it describes - for example, if the reader wants to know what Improve Presto compatibility with JDK 19.0.2 means - they can select the link to the PR and find that explanation in the Description part of your PR.

I suggest you move much of the content you have in your release note draft into the Description of your PR.

Each individual item does not have to be only a single line: I attempted to summarize each of yours at that high level, but you should revise my suggested draft and include detail you feel must be present in the release notes.

For PRs that include documentation changes, a link can be added to the release note entry that links directly to the new documentation.
See Formatting in the Release Notes Guidelines for how to code such a link.
For a specific example, see the 0.287 Release Notes and the release note entry:

"* Add configuration property legacy_json_cast whose default value is true. See Legacy Compatible Properties. #21869"

I hope this helps to answer your question, and please ask anything else you would like to know.

@AbhijitKulkarni1
Copy link
Contributor Author

Thanks @steveburnett for the guidance. I will add the property to Hive Configurations property file hive.rst. For release notes can you please clarify if they are picked up from PR automatically during release preparation or I need to add it somewhere because I didn't find the file release-0.288.rst, Kindly let me know.

@steveburnett
Copy link
Contributor

Of course! The release note entries are picked up from the PRs automatically during release preparation - if you have a correctly formatted block in your PR for your release note entries, that's all you need to do.
The automatically collected release note entries are reviewed and revised to create the .rst file.
My suggestions hope to reduce the work of the reviewers preparing the next release notes, which can help make the Presto release process easier and maybe even faster as well.

If you're interested in how the release notes are created for a Presto release, you're welcome to take a look at #22647.

@AbhijitKulkarni1 AbhijitKulkarni1 changed the title [WIP] Optimize Hive Connector ORC Readers and Filter using Java Vector API Optimize Hive Connector ORC Readers and Filter using Java Vector API Jun 12, 2024
@AbhijitKulkarni1 AbhijitKulkarni1 marked this pull request as ready for review June 12, 2024 15:31
@wanglinsong wanglinsong self-requested a review June 12, 2024 16:41
@ZacBlanco
Copy link
Contributor

ZacBlanco commented Jun 12, 2024

First off, awesome work. I'm sure this was difficult to get working properly. Upgrading Java versions can be tricky, especially for a large legacy system like Presto that has barely migrated off of 8. One concern I have though is with the choice of Java version. Java 19 is not an LTS release and support for this JVM version has already ended. It will not be receiving any further updates: https://javaalmanac.io/jdk/

It would be really great if this could target an LTS release. 21 is the latest LTS release and EOL is in 2028. I did a quick search and it seems there were some API changes between 19 and 21 in the Vector API. I'm not sure what was behind the decision to target Java 19, but it would be good to know the motivation. Not sure what others think about the Java version, but I think using LTS is desirable.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One formatting error that I fixed locally so I could review the content. The content is good, just fix the table entry and I'm happy to approve.

@AbhijitKulkarni1
Copy link
Contributor Author

AbhijitKulkarni1 commented Jun 13, 2024

First off, awesome work. I'm sure this was difficult to get working properly. Upgrading Java versions can be tricky, especially for a large legacy system like Presto that has barely migrated off of 8. One concern I have though is with the choice of Java version. Java 19 is not an LTS release and support for this JVM version has already ended. It will not be receiving any further updates: https://javaalmanac.io/jdk/

It would be really great if this could target an LTS release. 21 is the latest LTS release and EOL is in 2028. I did a quick search and it seems there were some API changes between 19 and 21 in the Vector API. I'm not sure what was behind the decision to target Java 19, but it would be good to know the motivation. Not sure what others think about the Java version, but I think using LTS is desirable.

Thanks for the compliments @ZacBlanco . Ensuring Presto worked with the latest Java was indeed a tough task. We started with compatibility explorations and considered Java 17 LTS, 18 for vectorized filtering with shuffle and rearrange Vector APIs to start with but chose Java 19 for "compress" Vector API (https://openjdk.org/jeps/426), which does shuffle and rearrange in one API call and performs better. I'm of the opinion that this solution should be compatible with Java 21 LTS; however, I must confirm its compatibility before moving forward with the submission.

steveburnett
steveburnett previously approved these changes Jun 20, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (doc)

Pull updated branch, new doc build, looks good!

@github-actions
Copy link

github-actions bot commented Oct 1, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 1a8c27d...787ceb1.

No notifications.

… vectorization-v1

# Conflicts:
#	presto-docs/src/main/sphinx/connector/hive.rst
… vectorization-v1

# Conflicts:
#	presto-docs/src/main/sphinx/connector/hive.rst
zhenxiao
zhenxiao previously approved these changes Oct 1, 2024
Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work
do we need to update documentation to change the compilation page?

import static jdk.incubator.vector.VectorOperators.LT;
import static jdk.incubator.vector.VectorOperators.NE;

public class DoubleRangeVector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about BigintRange and FloatRange, etc?

@zhenxiao
Copy link
Collaborator

zhenxiao commented Oct 1, 2024

hi @AbhijitKulkarni1 nice work! So glad to see latest java version upgrade
could you please make the PR on top of current master, and break it into separate commits, one commit for java upgrade, another one for vector api. And also make corresponding changes in the compilation documentation?
CC @beinan @ChunxuTang

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change!

Can you please clean up the commits and break it into 2 commits

  1. changes required for java > 19 support
  2. Orc vector changes

I do feel a bit worried about how this change will affect development, and would love to hear thoughts about it. Will we need to set our java version in intellij to java 19? will it make it hard for people to avoid using features after java8 until we upgrade our required java version for the whole project?

{
private final double lower;
private final double upper;
protected final double lower;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to keep these fields private and access via getters.

``hive.skip-empty-files`` Enable skipping empty files. Otherwise, it will produce an ``false``
error iterating through empty files.

``hive.orc-use-vector-filter`` Enable use of vector ORC readers in compilation and execution ``false``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note that this requires jdk 19?

}

@Config("hive.orc-use-vector-filter")
@ConfigDescription("Experimental: enable vector path for orc filters")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this requires jdk 19. Also, would probably remove "Experimental". Notes like that tend to just stick around forever. Would be great to add validation that the jdk meets the minimum version and fail with invalid_configuration_property otherwise.

false),
booleanProperty(
ORC_USE_VECTOR_FILTER,
"Experimental: enable vector path for orc filters",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about noting the jdk requirement and probably removing "experimental". Would be great to add validation that the jdk meets the minimum version if you set this, and fail with INVALID_SESSION_PROPERTY otherwise.

outputType.isPresent(),
systemMemoryContext.newOrcLocalMemoryContext(SelectiveStreamReaders.class.getSimpleName()));
}
catch (ClassNotFoundException | NoSuchMethodException | InstantiationException | IllegalAccessException | InvocationTargetException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should fail loudly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please help me know ? And BTW there is old flow available if the incubator classes are not available during runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to say instead of falling back if the classes aren't available, it would be better to fail the query so people know they have misconfigured presto.

private final boolean nonDeterministicFilter;
private final boolean nullsAllowed;
private final boolean outputRequired;
protected final TupleDomainFilter filter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of exposing all of these internal details to the whole package. I wonder if we can have a "FilterReader function" that gets passed into this class instead and that gets called in readWithFilter, and when the vector reader ins enable it's the VectorFilterReader and a DefaultFilterReader. then we could pass any required fields as function arguments?

import static jdk.incubator.vector.VectorOperators.NE;

public class DoubleRangeVector
extends TupleDomainFilter.DoubleRange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we get out of extending DoubleRange? Seems like we don't use any of the DoubleRange functionality. We should add those fields directly to DoubleRangeVector instead and make it independent of DoubleRange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoubleRangeVector is an extention of the scalar equivalent and in the vector processing logic for handling the scalar tail part of execution we need the DoubleRange functionality to filter in scalar way.

extends TupleDomainFilter.DoubleRange
implements TupleDomainFilterVector
{
protected DoubleRangeVector(double lower, boolean lowerUnbounded, boolean lowerExclusive, double upper, boolean upperUnbounded, boolean upperExclusive, boolean nullAllowed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be private?


<profiles>
<profile>
<id>java21PrestoCommon</id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious why the build profiles are named differently for every module vs. all having a profile called java21 and javaLessThan21?

</activation>
<dependencies>
<dependency>
<groupId>org.mockito</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use mocks. Please create stubbed out classes instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you kindly direct me to an example within Presto? That would be very helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhijitKulkarni1
Copy link
Contributor Author

Thanks for this change!

Can you please clean up the commits and break it into 2 commits

  1. changes required for java > 19 support
  2. Orc vector changes

I do feel a bit worried about how this change will affect development, and would love to hear thoughts about it. Will we need to set our java version in intellij to java 19? will it make it hard for people to avoid using features after java8 until we upgrade our required java version for the whole project?

The updates are handled such a way that it does not affect the compilation and execution for lower java version. This is done with specific maven profiles for JDK 21 during compile time. Also for Hive connector execution, the Vector API code is executed only if the flag 'hive.orc-use-vector-filter' is set to true and we have the Vector API support from Java runtime. So these updates do support backward compatibility.

@wanglinsong
Copy link
Member

Do we need to update other test workflows besides .github/workflows/hive-tests.yml?
How do we make sure other core functionalities work correctly with jdk21?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants