Skip to content

Conversation

@Kontinuation
Copy link
Member

Which issue does this PR close?

This is a minor, non-user-aware fix so I've not created a GtHub issue. A detailed explanation is provided in the next section.

Rationale for this change

I had this small patch in my local workspace for a while, I think it is useful for other comet developers as well. The changes are as follows:

  1. Fix the library name when not loading from the JAR bundle. The documentation of System.loadLibrary states that "The libname argument must not contain any platform specific prefix, file extension or path", so the library name should be comet instead of libcomet.so or libcomet.dylib. This fix makes it easier to load the newly built comet library in the native/target directory when running Java tests.
  2. Handle exceptions without error messages properly in native code. When there's an NPE in Scala code, the NPE exception raised by JVM may not have an error message. The native code will panic with the following error message:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.200.154 executor driver): org.apache.comet.CometNativeException: General execution error with reason: Null pointer in get_object_class.
	at org.apache.comet.Native.executePlan(Native Method)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:107)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1$adapted(CometExecIterator.scala:106)
	at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:106)
	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:118)
	at org.apache.comet.CometExecIterator.next(CometExecIterator.scala:135)

It does not point to the source of the NPE, which makes troubleshooting difficult. This patch checks if the error message retrieved by the .getMessage JNI call is null and handles it specially. The error message becomes:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.200.154 executor driver): java.lang.NullPointerException
	at org.apache.comet.CometNativeSuite$$anon$1.next(CometNativeSuite.scala:37)
	at org.apache.comet.CometNativeSuite$$anon$1.next(CometNativeSuite.scala:35)
	at org.apache.comet.CometBatchIterator.next(CometBatchIterator.java:56)
	at org.apache.comet.Native.executePlan(Native Method)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:107)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1$adapted(CometExecIterator.scala:106)
	at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:106)
	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:118)
	at org.apache.comet.CometExecIterator.next(CometExecIterator.scala:135)

What changes are included in this PR?

This PR includes fixes for the minor problems mentioned above.

How are these changes tested?

Add a unit test for problem 2.

@Kontinuation Kontinuation changed the title Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path fix: Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path Sep 30, 2024
@Kontinuation Kontinuation marked this pull request as ready for review September 30, 2024 11:26
@viirya
Copy link
Member

viirya commented Sep 30, 2024

Thanks @Kontinuation

@andygrove andygrove merged commit afd28b9 into apache:main Oct 2, 2024
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

```
cb3e977 perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin (apache#1007)
3df9d5c fix: Make comet-git-info.properties optional (apache#1027)
4033687 chore: Reserve memory for native shuffle writer per partition (apache#1022)
bd541d6 (public/main) remove hard-coded version number from Dockerfile (apache#1025)
e3ac6cf feat: Implement bloom_filter_agg (apache#987)
8d097d5 (origin/main) chore: Revert "chore: Reserve memory for native shuffle writer per partition (apache#988)" (apache#1020)
591f45a chore: Bump arrow-rs to 53.1.0 and datafusion (apache#1001)
e146cfa chore: Reserve memory for native shuffle writer per partition (apache#988)
abd9f85 fix: Fallback to Spark if named_struct contains duplicate field names (apache#1016)
22613e9 remove legacy comet-spark-shell (apache#1013)
d40c802 clarify that Maven central only has jars for Linux (apache#1009)
837c256 docs: Various documentation improvements (apache#1005)
0667c60 chore: Make parquet reader options Comet options instead of Hadoop options (apache#968)
0028f1e fix: Fallback to Spark if scan has meta columns (apache#997)
b131cc3 feat: Support `GetArrayStructFields` expression (apache#993)
3413397 docs: Update tuning guide (apache#995)
afd28b9 Quality of life fixes for easier hacking (apache#982)
18150fb chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled (apache#991)
a1599e2 chore: Update for 0.3.0 release, prepare for 0.4.0 development (apache#970)
```

## How are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants