Skip to content

Add support OrcReader based on DwrfReader#10194

Closed
wypb wants to merge 1 commit intofacebookincubator:mainfrom
wypb:orc_reader
Closed

Add support OrcReader based on DwrfReader#10194
wypb wants to merge 1 commit intofacebookincubator:mainfrom
wypb:orc_reader

Conversation

@wypb
Copy link
Contributor

@wypb wypb commented Jun 14, 2024

The ORC file format is used by many companies. Currently, the DWRF Reader in Velox can be used to read ORC files. This PR implements OrcReaderFactory based on DwrfReader and registers it in HiveConnectorFactory#initialize(). In this way, we can get a Reader that can read the ORC file format through dwio::common::getReaderFactory(FileFormat::ORC)->createReader(..).

CC: @Yuhta

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024
@netlify
Copy link

netlify bot commented Jun 14, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 05f6027
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66723655850bb00008243a56

@wypb wypb force-pushed the orc_reader branch 9 times, most recently from 78ab47f to d5906fc Compare June 18, 2024 03:42
@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Jun 18, 2024
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kagamiori
Copy link
Contributor

Hi @wypb, there are a few unit test failing internally. An example failure is

[ RUN      ] OrcReaderTest.testOrcReadAllType

E0618 14:16:53.845489 315752 Exceptions.h:67] Line: fbcode/velox/common/file/File.cpp:108, Function:LocalReadFile, Expression:  No such file or directory: /data/sandcastle/boxes/eden-trunk-hg-full-fbsource/fbcode/velox/dwio/dwrf/test/examples/orc_all_type.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: FILE_NOT_FOUND
Reason: No such file or directory: /data/sandcastle/boxes/eden-trunk-hg-full-fbsource/fbcode/velox/dwio/dwrf/test/examples/orc_all_type.orc
Retriable: False
Function: LocalReadFile
File: fbcode/velox/common/file/File.cpp
Line: 108

I wonder whether we need to update the getExampleFilePath() method used in the unit tests. Also wondering why these unit tests do not fail in CI here. Could you take a look?

@wypb wypb force-pushed the orc_reader branch 2 times, most recently from 7372895 to d0c48b2 Compare June 19, 2024 01:35
@wypb
Copy link
Contributor Author

wypb commented Jun 19, 2024

Hi @kagamiori Thank you for you review.
I checked the reason. CI here will not fail because current_path does not end with fbcode. Meta internally fails because current_path ends with fbcode, so velox/dwio/dwrf/test is added to the file path.

std::string getDataFilePath(
    const std::string& baseDir,
    const std::string& filePath) {
  std::string current_path = fs::current_path().c_str();
  if (boost::algorithm::ends_with(current_path, "fbcode")) {
    return current_path + "/" + baseDir + "/" + filePath;
  }
  return current_path + "/" + filePath;
}

The getExampleFilePath method should be rewritten in orc test. The baseDir parameter of the getDataFilePath method needs to be set to velox/dwio/orc/test.

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in 1a50a8a.

@kagamiori
Copy link
Contributor

Hi @kagamiori Thank you for you review. I checked the reason. CI here will not fail because current_path does not end with fbcode. Meta internally fails because current_path ends with fbcode, so velox/dwio/dwrf/test is added to the file path.

std::string getDataFilePath(
    const std::string& baseDir,
    const std::string& filePath) {
  std::string current_path = fs::current_path().c_str();
  if (boost::algorithm::ends_with(current_path, "fbcode")) {
    return current_path + "/" + baseDir + "/" + filePath;
  }
  return current_path + "/" + filePath;
}

The getExampleFilePath method should be rewritten in orc test. The baseDir parameter of the getDataFilePath method needs to be set to velox/dwio/orc/test.

Wow, I didn't even know we hard-code "fbcode" in our code...Thank you for looking into it and fix the unit test!

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 1a50a8af.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot pushed a commit that referenced this pull request Jul 30, 2024
Summary:
This is a follow-up PR of #10194 to address test failures in the out-of-source build:

```
E20240728 Exceptions.h:67] Line: velox/velox/common/file/File.cpp:112, Function:LocalReadFile, Expression:  No such file or directory: cmake-build-debug/velox/velox/dwio/orc/test/examples/TestStringDictionary.testRowIndex.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
E20240728 Exceptions.h:67] Line: velox/velox/common/file/File.cpp:112, Function:LocalReadFile, Expression:  No such file or directory: cmake-build-debug/velox/velox/dwio/orc/test/examples/complextypes_iceberg.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
E20240728 Exceptions.h:67] Line: velox/velox/common/file/File.cpp:112, Function:LocalReadFile, Expression:  No such file or directory: cmake-build-debug/velox/velox/dwio/orc/test/examples/orc_index_int_string.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
E20240728 Exceptions.h:67] Line: velox/velox/common/file/File.cpp:112, Function:LocalReadFile, Expression:  No such file or directory: cmake-build-debug/velox/velox/dwio/orc/test/examples/TestOrcFile.testDate1900.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
E20240728 Exceptions.h:67] Line: velox/velox/common/file/File.cpp:112, Function:LocalReadFile, Expression:  No such file or directory: cmake-build-debug/velox/velox/dwio/orc/test/examples/orc_all_type.orc, Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
```

Pull Request resolved: #10588

Reviewed By: pedroerp

Differential Revision: D60452930

Pulled By: Yuhta

fbshipit-source-id: adeb972bd89ed0696639ab8b3379549f3138abe3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants