Optimize Hive Connector ORC Readers and Filter using Java Vector API#21670
Optimize Hive Connector ORC Readers and Filter using Java Vector API#21670AbhijitKulkarni1 wants to merge 12 commits intoprestodb:masterfrom
Conversation
|
@AbhijitKulkarni1 this would be a very nice and exciting change. I am wondering if you need any help? |
|
Suggest a change to the current release notes entry: perhaps this draft describes the effect of this PR? |
|
@tdcmeehan Thanks for reaching out. We are in process of internal data review and would be able to submit PR for review shortly. Will surely connect with you in case I need any help. |
|
Two items:
|
|
@steveburnett Can you please let me know about the second point? Do you mean to add one liner release notes and move the details in other section?
|
|
Good question, @AbhijitKulkarni1! I suggest that each release note entry should be a summary of the work done in the PR. The summary version allows a reader of the new Release Notes to quickly identify if an entry is something they need more detail on. If not, they can move on to the next entry. You could examine the recent release notes, for example the 0.287 Release Notes, for examples. Because each release note entry ends with a link to the PR it describes - for example, if the reader wants to know what I suggest you move much of the content you have in your release note draft into the Description of your PR. Each individual item does not have to be only a single line: I attempted to summarize each of yours at that high level, but you should revise my suggested draft and include detail you feel must be present in the release notes. For PRs that include documentation changes, a link can be added to the release note entry that links directly to the new documentation. "* Add configuration property legacy_json_cast whose default value is true. See Legacy Compatible Properties. #21869" I hope this helps to answer your question, and please ask anything else you would like to know. |
|
Thanks @steveburnett for the guidance. I will add the property to Hive Configurations property file hive.rst. For release notes can you please clarify if they are picked up from PR automatically during release preparation or I need to add it somewhere because I didn't find the file release-0.288.rst, Kindly let me know. |
e567301 to
0ecca0c
Compare
|
Of course! The release note entries are picked up from the PRs automatically during release preparation - if you have a correctly formatted block in your PR for your release note entries, that's all you need to do. If you're interested in how the release notes are created for a Presto release, you're welcome to take a look at #22647. |
presto-orc/src/main/java/com/facebook/presto/orc/OrcSelectiveRecordReader.java
Outdated
Show resolved
Hide resolved
|
First off, awesome work. I'm sure this was difficult to get working properly. Upgrading Java versions can be tricky, especially for a large legacy system like Presto that has barely migrated off of 8. One concern I have though is with the choice of Java version. Java 19 is not an LTS release and support for this JVM version has already ended. It will not be receiving any further updates: https://javaalmanac.io/jdk/ It would be really great if this could target an LTS release. 21 is the latest LTS release and EOL is in 2028. I did a quick search and it seems there were some API changes between 19 and 21 in the Vector API. I'm not sure what was behind the decision to target Java 19, but it would be good to know the motivation. Not sure what others think about the Java version, but I think using LTS is desirable. |
steveburnett
left a comment
There was a problem hiding this comment.
One formatting error that I fixed locally so I could review the content. The content is good, just fix the table entry and I'm happy to approve.
Thanks for the compliments @ZacBlanco . Ensuring Presto worked with the latest Java was indeed a tough task. We started with compatibility explorations and considered Java 17 LTS, 18 for vectorized filtering with shuffle and rearrange Vector APIs to start with but chose Java 19 for "compress" Vector API (https://openjdk.org/jeps/426), which does shuffle and rearrange in one API call and performs better. I'm of the opinion that this solution should be compatible with Java 21 LTS; however, I must confirm its compatibility before moving forward with the submission. |
0ecca0c to
ee45576
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (doc)
Pull updated branch, new doc build, looks good!
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 1a8c27d...787ceb1. No notifications. |
… vectorization-v1 # Conflicts: # presto-docs/src/main/sphinx/connector/hive.rst
… vectorization-v1 # Conflicts: # presto-docs/src/main/sphinx/connector/hive.rst
fe30141 to
f5f0a66
Compare
…resto into vectorization-v1 # Conflicts: # presto-docs/src/main/sphinx/connector/hive.rst
…l. Add the hive.orc-use-vector-filter configuration
zhenxiao
left a comment
There was a problem hiding this comment.
Nice work
do we need to update documentation to change the compilation page?
| import static jdk.incubator.vector.VectorOperators.LT; | ||
| import static jdk.incubator.vector.VectorOperators.NE; | ||
|
|
||
| public class DoubleRangeVector |
There was a problem hiding this comment.
how about BigintRange and FloatRange, etc?
|
hi @AbhijitKulkarni1 nice work! So glad to see latest java version upgrade |
rschlussel
left a comment
There was a problem hiding this comment.
Thanks for this change!
Can you please clean up the commits and break it into 2 commits
- changes required for java > 19 support
- Orc vector changes
I do feel a bit worried about how this change will affect development, and would love to hear thoughts about it. Will we need to set our java version in intellij to java 19? will it make it hard for people to avoid using features after java8 until we upgrade our required java version for the whole project?
| { | ||
| private final double lower; | ||
| private final double upper; | ||
| protected final double lower; |
There was a problem hiding this comment.
probably better to keep these fields private and access via getters.
| ``hive.skip-empty-files`` Enable skipping empty files. Otherwise, it will produce an ``false`` | ||
| error iterating through empty files. | ||
|
|
||
| ``hive.orc-use-vector-filter`` Enable use of vector ORC readers in compilation and execution ``false`` |
There was a problem hiding this comment.
add a note that this requires jdk 19?
| } | ||
|
|
||
| @Config("hive.orc-use-vector-filter") | ||
| @ConfigDescription("Experimental: enable vector path for orc filters") |
There was a problem hiding this comment.
note that this requires jdk 19. Also, would probably remove "Experimental". Notes like that tend to just stick around forever. Would be great to add validation that the jdk meets the minimum version and fail with invalid_configuration_property otherwise.
| false), | ||
| booleanProperty( | ||
| ORC_USE_VECTOR_FILTER, | ||
| "Experimental: enable vector path for orc filters", |
There was a problem hiding this comment.
same comment about noting the jdk requirement and probably removing "experimental". Would be great to add validation that the jdk meets the minimum version if you set this, and fail with INVALID_SESSION_PROPERTY otherwise.
| outputType.isPresent(), | ||
| systemMemoryContext.newOrcLocalMemoryContext(SelectiveStreamReaders.class.getSimpleName())); | ||
| } | ||
| catch (ClassNotFoundException | NoSuchMethodException | InstantiationException | IllegalAccessException | InvocationTargetException e) { |
There was a problem hiding this comment.
we should fail loudly here.
There was a problem hiding this comment.
Can you please help me know ? And BTW there is old flow available if the incubator classes are not available during runtime.
There was a problem hiding this comment.
I mean to say instead of falling back if the classes aren't available, it would be better to fail the query so people know they have misconfigured presto.
| private final boolean nonDeterministicFilter; | ||
| private final boolean nullsAllowed; | ||
| private final boolean outputRequired; | ||
| protected final TupleDomainFilter filter; |
There was a problem hiding this comment.
I don't love the idea of exposing all of these internal details to the whole package. I wonder if we can have a "FilterReader function" that gets passed into this class instead and that gets called in readWithFilter, and when the vector reader ins enable it's the VectorFilterReader and a DefaultFilterReader. then we could pass any required fields as function arguments?
| import static jdk.incubator.vector.VectorOperators.NE; | ||
|
|
||
| public class DoubleRangeVector | ||
| extends TupleDomainFilter.DoubleRange |
There was a problem hiding this comment.
what do we get out of extending DoubleRange? Seems like we don't use any of the DoubleRange functionality. We should add those fields directly to DoubleRangeVector instead and make it independent of DoubleRange.
There was a problem hiding this comment.
DoubleRangeVector is an extention of the scalar equivalent and in the vector processing logic for handling the scalar tail part of execution we need the DoubleRange functionality to filter in scalar way.
| extends TupleDomainFilter.DoubleRange | ||
| implements TupleDomainFilterVector | ||
| { | ||
| protected DoubleRangeVector(double lower, boolean lowerUnbounded, boolean lowerExclusive, double upper, boolean upperUnbounded, boolean upperExclusive, boolean nullAllowed) |
|
|
||
| <profiles> | ||
| <profile> | ||
| <id>java21PrestoCommon</id> |
There was a problem hiding this comment.
just curious why the build profiles are named differently for every module vs. all having a profile called java21 and javaLessThan21?
| </activation> | ||
| <dependencies> | ||
| <dependency> | ||
| <groupId>org.mockito</groupId> |
There was a problem hiding this comment.
We don't use mocks. Please create stubbed out classes instead.
There was a problem hiding this comment.
Could you kindly direct me to an example within Presto? That would be very helpful.
There was a problem hiding this comment.
you can look at any classes named TestingXXX as an example. We create testing implementations of interfaces that follow the needs of a test. Can be simple or complex as needed. Here are a few examples, and you can see how they are used in tests: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/testing/TestingMetadata.java, https://github.com/prestodb/presto/blob/master/presto-spi/src/test/java/com/facebook/presto/spi/ttl/TestingNodeTtlFetcher.java, https://github.com/prestodb/presto/blob/master/presto-spi/src/test/java/com/facebook/presto/spi/TestingColumnHandle.java, https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/testing/TestingAccessControlManager.java
The updates are handled such a way that it does not affect the compilation and execution for lower java version. This is done with specific maven profiles for JDK 21 during compile time. Also for Hive connector execution, the Vector API code is executed only if the flag 'hive.orc-use-vector-filter' is set to true and we have the Vector API support from Java runtime. So these updates do support backward compatibility. |
|
Do we need to update other test workflows besides .github/workflows/hive-tests.yml? |
Description
Enable Presto to be compatible with JDK 21.0.3 LTS version. Related to #16268
Optimize Hive connector to harness data parallelism in ORC Readers and Filters using Java Vector API
Motivation and Context
This PR enables Presto to be used with latest features available in JDK 21.0.3 LTS version like Java Vector API and Foreign API updates.
Vectorization of ORC Readers:
Presto Hive Connector for Presto ORC module is optimized with vectorized Reader to harness data parallelism using Java Vector API. Hive Connector is updated to provide hive property "orc-use-vector-filter" to use the optimized Presto ORC to read ORC format data in vectorized way.
Following are the scope of optimizations in this PR to start with:
Class Design:

Presto Double Filter:
Class DoubleRange implements the interface TupleDomainFilter having the scalar version of filtering double values for certain range as part of testDouble() method.
DoubleRangeVector class is a newly introduced class for implementation of new interface TupleDomainFilterVector, providing vectorized implementation of filtering double values for certain range as part of testDoubleVector() method.
Presto Double Reader:

DoubleSelectiveStreamReader class is the scalar reading implementation where double values are read one after another and testDouble() method is called for filtering.
As shown in above figure, the vectorization design extends the DoubleSelectiveStreamReader class with a new DoubleSelectiveStreamVectorReader class and further provides interfaces to write vectorized reading through a new DoubleVectorSelectiveReadAndFilter interface.
The implementation of this interface is the DoubleVectorSelectiveReadAndFilterSpecies512 class, which provides the vectorized reading for bulk data needed for vectorized filtering using testDoubleVector() method.
This design is crafted with the solution's extensibility in mind to provide implementations for other AVX bit width too, just by implementing the DoubleVectorSelectiveReadAndFilter interface.
Note the following points about this optimization.
Updates are compatible with JDK 21.0.3 version. So use jdk 21.0.3 for compilation and execution of Presto binary.
The JDK 21.0.3 compilation for vector updates in presto-common, presto-orc modules are handled using maven profiles. The updates are backward compatible with the lower version JDK compilation (JDK 8 and 11) and do not impact at runtime with lower JRE execution.
Note the following points for compilation and launch of Presto using JDK 21.0.3
i. Compilation:
ii. Runtime in Presto "etc\jvm.config" file set the following jvm options
Hive Connector Changes
Addition of hive property "orc-use-vector-filter" to enable/disable the use of vector ORC readers in Hive connector. If set to "true" in hive.properties file (hive.orc-use-vector-filter="true"), new vectorized readers and filters are used for compilation/execution. Default being "false" for current non vectorized flow.
Impact
JDK21 specific maven profiles are added to update the versions of dependencies/plugins as required for Presto to be compatible with JDK 21.0.3 LTS version. These updates do not impact the default compilation with JDK8 or 11 which is officially supported.
No functional impact for execution of Presto with JDK8 or 11 which are officially supported.
Presto execution with JDK 21.0.3, Improved Hive Connector for ORC Readers and Filters with vectorization using Java Vector API for Uncompressed ORC data reading and filtering.
Please find the claims of throughput (queries execution per unit time) improvement as below:
Observed above data points on SF100 Uncompressed TPCH data in ORC format executed with Presto on one of the socket of m7i.metal-48xl aws bare metal machine.
Query Used for Evaluation is similar to q01 and q06 filter queries of TPCH. Query used to evaluate the stress filtering of double data is:
Throughput is measured for 100 concurrent executions of the query.
Test Plan
Test all smoke test suites
Added unit tests to verify the vectorization updates and performed test coverage.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.