-
Notifications
You must be signed in to change notification settings - Fork 29k
[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… when two machines have different Oops size
… for better control
|
Can one of the admins verify this patch? |
|
It would be for performance, for the same reason all the basic spark classes are registered; otherwise Kryo serializes the class name with each instance. Why wouldn't it matter? maybe I'm missing something |
## What changes were proposed in this pull request?
This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better.
The hinted strategy will be used for the join with which it is associated if it is applicable/doable.
Conflict resolving rules in case of multiple hints:
1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) */ /*+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint.
2. Conflicts between two sides of the join:
a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```.
b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size.
## How was this patch tested?
Added new UTs.
Closes apache#24164 from maryannxue/join-hints.
Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
The kyro & java serialization are using the same method that's why I think it would makes no different. Thanks |
|
I'm thinking of what KryoSerializer.newKryo does; I'm guessing we should register all Unsafe Kryo-serializable classes by default |
… postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes apache#24314 from srowen/SPARK-27404. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ding with `.sql` ## What changes were proposed in this pull request? While using vi or vim to edit the test files the .swp or .swo files are created and attempt to run the test suite in the presence of these files causes errors like below : ``` nfo] - subquery/exists-subquery/.exists-basic.sql.swp *** FAILED *** (117 milliseconds) [info] java.io.FileNotFoundException: /Users/dbiswal/mygit/apache/spark/sql/core/target/scala-2.12/test-classes/sql-tests/results/subquery/exists-subquery/.exists-basic.sql.swp.out (No such file or directory) [info] at java.io.FileInputStream.open0(Native Method) [info] at java.io.FileInputStream.open(FileInputStream.java:195) [info] at java.io.FileInputStream.<init>(FileInputStream.java:138) [info] at org.apache.spark.sql.catalyst.util.package$.fileToString(package.scala:49) [info] at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:247) [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$11(SQLQueryTestSuite.scala:192) ``` ~~This minor pr adds these temp files in the ignore list.~~ While computing the list of test files to process, only consider files with `.sql` extension. This makes sure the unwanted temp files created from various editors are ignored from processing. ## How was this patch tested? Verified manually. Closes apache#24333 from dilipbiswal/dkb_sqlquerytest. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…lity ## What changes were proposed in this pull request? `Krb5LoginModule` supports debug parameter which is not yet supported from Spark side. This configuration makes it easier to debug authentication issues against Kafka. In this PR `Krb5LoginModule` debug flag controlled by either `sun.security.krb5.debug` or `com.ibm.security.krb5.Krb5Debug`. Additionally found some hardcoded values like `ssl.truststore.location`, etc... which could be error prone if Kafka changes it so in such cases Kafka define used. ## How was this patch tested? Existing + additional unit tests + on cluster. Closes apache#24204 from gaborgsomogyi/SPARK-27270. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
… and FromUnixTime
## What changes were proposed in this pull request?
SPARK-27199 introduced the use of `ZoneId` instead of `TimeZone` in a few date/time expressions.
There were 3 occurrences of `ctx.addReferenceObj("zoneId", zoneId)` in that PR, which had a bug because while the `java.time.ZoneId` base type is public, the actual concrete implementation classes are not public, so using the 2-arg version of `CodegenContext.addReferenceObj` would incorrectly generate code that reference non-public types (`java.time.ZoneRegion`, to be specific). The 3-arg version should be used, with the class name of the referenced object explicitly specified to the public base type.
One of such occurrences was caught in testing in the main PR of SPARK-27199 (apache#24141), for `DateFormatClass`. But the other 2 occurrences slipped through because there were no test cases that covered them.
Example of this bug in the current Apache Spark master, in a Spark Shell:
```
scala> Seq(("2016-04-08", "yyyy-MM-dd")).toDF("s", "f").repartition(1).selectExpr("to_unix_timestamp(s, f)").show
...
java.lang.IllegalAccessError: tried to access class java.time.ZoneRegion from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1
```
This PR fixes the codegen issues and adds the corresponding unit tests.
## How was this patch tested?
Enhanced tests in `DateExpressionsSuite` for `to_unix_timestamp` and `from_unixtime`.
Closes apache#24352 from rednaxelafx/fix-spark-27199.
Authored-by: Kris Mok <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
|
@srowen done. Let me know if you have any other advice. |
… when two machines have different Oops size
… for better control
… when two machines have different Oops size
What changes were proposed in this pull request?
Finish the rest work of #24317, #9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.
How was this patch tested?
According Units has been added & tested