[native] Add support for ORC reader #23037

wypb · 2024-06-20T08:26:56Z

Description

We have recently merged the PR for reading ORC statistics and implementing OrcReader based on DwrfReader on the velox side. Now it is time to add support for ORC reader it in Prestissimo.

wypb · 2024-06-25T11:04:30Z

Hi @majetideepak @aditi-pandit could you please help review this PR? Thanks!

majetideepak · 2024-06-25T14:23:17Z

@wypb can you add some end-to-end tests? Thanks!

aditi-pandit · 2024-06-25T16:53:38Z

@wypb : Would be great to use ORC with the QueryRunners (https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java) in an e2e test. The test should highlight differences of ORC wrt Parquet, demonstrate filter pushdown as well. Using ORC with Hive and as a format with Iceberg is perfect.

wypb · 2024-06-26T11:55:55Z

Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.

Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
	at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
	at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
	at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
	at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
	at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
	at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 46 0x00000000000c2b23(Unknown Source)
	at Unknown.# 47 start_thread(Unknown Source)
	at Unknown.# 48 clone(Unknown Source)

aditi-pandit · 2024-06-28T06:17:05Z

@wypb : Your code looks fine. When I search for ORC in the presto-native-execution directory I also see the following usage.

https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestWriter.java#L71 needs a fix as well

Please can you check about it.

wypb · 2024-06-28T09:34:03Z

Good catch, thank you @aditi-pandit I've fixed it.

wypb · 2024-06-28T09:39:49Z

@aditi-pandit I looked at the code again and found that this should not be removed. testCreateTableWithUnsupportedFormats is used to test the Velox ORC writer, and Velox currently does not support ORC writing.

aditi-pandit · 2024-07-02T00:54:33Z

...execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeTpcdsQueries.java

        if (!queryRunner.tableExists(session, "call_center")) {
            switch (storageFormat) {
                case "PARQUET":
+                case "ORC":


As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.

I recently added a parameter to createAllTables to not do the DATE -> VARCHAR casting for TPCH tables https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/NativeQueryRunnerUtils.java#L69. You can use it in your tests.

aditi-pandit · 2024-07-02T01:06:31Z

Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.

Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
	at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
	at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
	at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
	at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
	at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
	at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 46 0x00000000000c2b23(Unknown Source)
	at Unknown.# 47 start_thread(Unknown Source)
	at Unknown.# 48 clone(Unknown Source)

@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.

tdcmeehan · 2024-08-23T19:57:50Z

@wypb is this PR still being worked on?

wypb · 2024-08-26T12:09:39Z

Hi @tdcmeehan sorry for the late reply.

Yes, I'm still keeping an eye on this. I've been working on a few Velox PRs lately, so I haven't had time to work on this yet. I'll update this PR later this week.

tdcmeehan · 2024-08-30T16:37:24Z

Let's add ORC as a supported file format in Supported Use Cases (we can also mention that Parquet is a supported format).

aditi-pandit

Thanks @wypb. Have a question:

As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

aditi-pandit · 2024-09-03T05:32:09Z

.../java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergTpchQueriesOrcUsingThrift.java

+    @Override
+    protected ExpectedQueryRunner createExpectedQueryRunner() throws Exception
+    {
+        this.storageFormat = "ORC";


Since this is a member variable and the same value used in all methods, you can initialize it at the class level outside the methods and use this.storageFormat each place.

Already refactored, thank you.

steveburnett · 2024-09-03T14:23:15Z

Should this be mentioned in the doc, maybe in Supported Use Cases or Presto C++ Features?

wypb · 2024-09-04T03:04:20Z

Hi @tdcmeehan, @aditi-pandit sorry for the late reply.

@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.

I was also curious about this question before, but I didn't check the reason. Today I checked why most of the TPCDS queries failed, while all the TPCH queries passed. I debugged the code and found that the integer fields of the TPCDS table (such as the cs_sold_date_sk field of the catalog_sales table) may be NULL, and Velox does not implement the fastpath logic for integer fields encoded as RLEv2 in ORC. These two reasons combined cause most of the TPCDS queries to fail. The TPCH table fields will not be NULL, so this exception will not be triggered.

For related code, see SelectiveColumnReader::prepareNulls
https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/SelectiveColumnReader.cpp#L103-L129

void SelectiveColumnReader::prepareNulls(
    RowSet rows,
    bool hasNulls,
    int32_t extraRows) {
  if (!hasNulls) {
    anyNulls_ = false;
    return;
  }
  initReturnReaderNulls(rows);
  if (returnReaderNulls_) {
    // No need for null flags if fast path.
    return;
  }
  auto numRows = rows.size() + extraRows;
  if (resultNulls_ && resultNulls_->unique() &&
      resultNulls_->capacity() >= bits::nbytes(numRows) + simd::kPadding) {
    resultNulls_->setSize(bits::nbytes(numRows));
  } else {
    resultNulls_ = AlignedBuffer::allocate<bool>(
        numRows + (simd::kPadding * 8), &memoryPool_);
    rawResultNulls_ = resultNulls_->asMutable<uint64_t>();
  }
  anyNulls_ = false;
  // Clear whole capacity because future uses could hit uncleared data between
  // capacity() and 'numBytes'.
  simd::memset(rawResultNulls_, bits::kNotNullByte, resultNulls_->capacity());
}

For the TPCH table, hasNulls is false, so there is no need to initialize rawResultNulls_, and SelectiveColumnReader#addNull() will not be called later (there is a VELOX_DCHECK(rawResultNulls_ && rawValues_) in it, which causes the query of the TPCDS table to report an exception); for the TPCDS table, hasNulls is true, and then SelectiveColumnReader::initReturnReaderNulls is executed, returnReaderNulls_ = true is calculated, and then it returns. In SelectiveColumnReader::prepareNulls, rawResultNulls_ will not be initialized, which causes an exception in the subsequent call to SelectiveColumnReader#addNull().

void SelectiveColumnReader::initReturnReaderNulls(RowSet rows) {
  if (useBulkPath() && !scanSpec_->hasFilter()) {
    anyNulls_ = nullsInReadRange_ != nullptr;
    bool isDense = rows.back() == rows.size() - 1;
    returnReaderNulls_ = anyNulls_ && isDense;
  } else {
    returnReaderNulls_ = false;
  }
}

If we modify the implementation of SelectiveIntegerDirectColumnReader#hasBulkPath() to the following logic, the TPCDS query will also succeed.

  bool hasBulkPath() const override {
    return format == DwrfFormat::kOrc && version == RleVersion_2 ? false : true;
  }

wypb · 2024-09-04T03:08:36Z

As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.

aditi-pandit · 2025-02-12T21:11:10Z

The motivation to keep TPCDS test updates minimal is that if due to any reason we need to revert a test, the ORC Reader support will not be impacted.

@majetideepak : Am a bit conflicted on this. The ORC Reader commit should submit the functional tests that establish the feature I feel. This work takes care of that.

If there are specific tests that show issues (like the current Decimal one), then we should disable and fix them individually.

wypb · 2025-02-13T02:18:20Z

Hi @majetideepak deepak, Do you mean TestPrestoNativeIcebergTpcdsQueriesOrcUsingThrift.java and TestPrestoNativeIcebergTpcdsQueriesParquetUsingThrift.java? If so, it is because Velox currently does not support pushing scanSpec_#filter down to the SelectiveDecimalColumnReader for ORC, which has a VELOX_CHECK(!scanSpec_->filter()); check. This will cause some TPCDS queries fail, so TestPrestoNativeIcebergTpcdsQueriesOrcUsingThrift#runAllQueries was rewritten separately.

majetideepak

@wypb, @aditi-pandit I now see that doDeletes got refactored and is not new.
Makes sense to cover all the existing testing for ORC.
I just have one comment.

majetideepak · 2025-02-13T13:36:10Z

...java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergTpcdsQueriesOrcUsingThrift.java

+        testTpcdsQ18();
+        testTpcdsQ19();
+        testTpcdsQ20();
+        // testTpcdsQ21();


Do we need to comment here given we have a check inside AbstractTestNativeTpcdsQueries?

+1. Nice catch. Yeah these can be avoided.

@wypb : Seems like there was a misunderstanding about this review comment.

Since testTpcdsQ21 has the following condition

if (!storageFormat.equals("ORC")) { assertQuery(session, getTpcdsQuery("33")); }

then it should pass in runAllQueries. We don't need to comment it in this function. We can uncomment the test call at this point.

Got it, I will refactor the code

@majetideepak @aditi-pandit I have moved runAllQueries() from TestPrestoNativeIcebergTpcdsQueriesOrcUsingThrift.java and TestPrestoNativeIcebergTpcdsQueriesParquetUsingThrift.java to AbstractTestNativeTpcdsQueries.java.

aditi-pandit · 2025-02-14T07:13:10Z

@wypb : Thanks for your quick turnaround. Seems like there was a misunderstanding about Deepak's comment. Please fix it. Else this PR is looking good for approval.

majetideepak

Thanks, @wypb

majetideepak · 2025-02-14T14:18:45Z

presto-docs/src/main/sphinx/presto-cpp.rst


 * Iceberg connector supports both V1 and V2 tables, including tables with delete files.

+* Supports reading and writing of DWRF and PARQUET file formats, ORC only supports reading.


nit: I think this should say
Supports reading and writing of DWRF and PARQUET file formats, supports reading ORC file format.
CC: @steveburnett

aditi-pandit

Thanks @wypb

presto-docs/src/main/sphinx/presto-cpp.rst

steveburnett

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

majetideepak

thanks, @wypb

majetideepak · 2025-02-18T14:46:39Z

@aditi-pandit do you have any other comments?

aditi-pandit

Thanks @wypb

wypb requested a review from a team as a code owner June 20, 2024 08:26

wypb force-pushed the orc_reader branch 3 times, most recently from 55a8d5b to 7325337 Compare June 21, 2024 01:52

tdcmeehan self-assigned this Jun 23, 2024

wypb force-pushed the orc_reader branch from e4e7c5d to 5e91bc9 Compare June 25, 2024 10:56

wypb changed the title ~~[native] Add support for ORC reader and add orc native tests~~ [native] Add support for ORC reader Jun 25, 2024

wypb force-pushed the orc_reader branch from 99f309f to 39750f1 Compare June 25, 2024 10:58

wypb force-pushed the orc_reader branch from d2aacde to d8d1c28 Compare June 26, 2024 11:53

wypb force-pushed the orc_reader branch 2 times, most recently from 0d3570c to 9615017 Compare June 28, 2024 09:32

wypb force-pushed the orc_reader branch from dd5bf8f to 78fdffd Compare June 28, 2024 09:38

aditi-pandit reviewed Jul 2, 2024

View reviewed changes

wypb force-pushed the orc_reader branch from 3c652cf to e3f1d8d Compare August 30, 2024 08:59

tdcmeehan requested a review from aditi-pandit August 30, 2024 16:35

aditi-pandit reviewed Sep 3, 2024

View reviewed changes

majetideepak reviewed Feb 13, 2025

View reviewed changes

wypb force-pushed the orc_reader branch 2 times, most recently from 684cdfc to 99f7c7b Compare February 14, 2025 04:55

wypb force-pushed the orc_reader branch from 8c21f2e to 7523bc7 Compare February 14, 2025 07:41

majetideepak previously approved these changes Feb 14, 2025

View reviewed changes

majetideepak reviewed Feb 14, 2025

View reviewed changes

aditi-pandit previously approved these changes Feb 14, 2025

View reviewed changes

steveburnett requested changes Feb 17, 2025

View reviewed changes

presto-docs/src/main/sphinx/presto-cpp.rst Outdated Show resolved Hide resolved

wypb dismissed stale reviews from aditi-pandit and majetideepak via 1dfdde8 February 18, 2025 01:37

wypb force-pushed the orc_reader branch 2 times, most recently from 4588dd0 to b6c744d Compare February 18, 2025 03:04

[native] Add support for ORC reader and add orc native tests

36fda46

wypb force-pushed the orc_reader branch from b6c744d to 36fda46 Compare February 18, 2025 03:08

steveburnett approved these changes Feb 18, 2025

View reviewed changes

majetideepak approved these changes Feb 18, 2025

View reviewed changes

aditi-pandit approved these changes Feb 18, 2025

View reviewed changes

aditi-pandit merged commit 8accda9 into prestodb:master Feb 18, 2025
61 checks passed

wypb deleted the orc_reader branch February 18, 2025 23:30

ethanyzhang mentioned this pull request Feb 22, 2025

Add support for ORC reader facebookincubator/velox#858

Closed

This was referenced Mar 10, 2025

Add release notes for 0.292 unix280/presto#5

Closed

Add release notes for 0.292 unix280/presto#6

Closed

prestodb-ci mentioned this pull request Mar 28, 2025

Add release notes for 0.292 #24825

Merged

30 tasks

unidevel mentioned this pull request Apr 25, 2025

Add release notes for 0.292 unix280/presto#23

Closed

30 tasks

This was referenced May 6, 2025

Add release notes for 0.292 unix280/presto#28

Closed

Add release notes for 0.292 unix280/presto#29

Closed


		* Iceberg connector supports both V1 and V2 tables, including tables with delete files.

		* Supports reading and writing of DWRF and PARQUET file formats, ORC only supports reading.

[native] Add support for ORC reader #23037

[native] Add support for ORC reader #23037

Uh oh!

Conversation

wypb commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

wypb commented Jun 25, 2024

Uh oh!

majetideepak commented Jun 25, 2024

Uh oh!

aditi-pandit commented Jun 25, 2024

Uh oh!

wypb commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditi-pandit commented Jun 28, 2024

Uh oh!

wypb commented Jun 28, 2024

Uh oh!

wypb commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditi-pandit Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdcmeehan commented Aug 23, 2024

Uh oh!

wypb commented Aug 26, 2024

Uh oh!

tdcmeehan commented Aug 30, 2024

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveburnett commented Sep 3, 2024

Uh oh!

wypb commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypb commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditi-pandit commented Feb 12, 2025

Uh oh!

wypb commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majetideepak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit commented Feb 14, 2025

Uh oh!

majetideepak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wypb commented Jun 20, 2024 •

edited

Loading

wypb commented Jun 26, 2024 •

edited

Loading

wypb commented Jun 28, 2024 •

edited

Loading

aditi-pandit Jul 2, 2024 •

edited

Loading

aditi-pandit commented Jul 2, 2024 •

edited

Loading

wypb commented Sep 4, 2024 •

edited

Loading

wypb commented Sep 4, 2024 •

edited

Loading

wypb commented Feb 13, 2025 •

edited

Loading