[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349

pengbo · 2019-04-11T15:59:34Z

What changes were proposed in this pull request?

Finish the rest work of #24317, #9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

How was this patch tested?

According Units has been added & tested

… when two machines have different Oops size

… for better control

pengbo · 2019-04-11T16:02:25Z

@srowen Sorry for closing #24340 accidentally

To Continue our discussion:

OK do we need to add these classes to the list that is automatically registered with Kryo by default too?

I am not quite sure what's the main advantage. If that's for performance, I think in this case it makes no difference.

AmplabJenkins · 2019-04-11T16:06:04Z

Can one of the admins verify this patch?

srowen · 2019-04-11T16:09:47Z

It would be for performance, for the same reason all the basic spark classes are registered; otherwise Kryo serializes the class name with each instance. Why wouldn't it matter? maybe I'm missing something

## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) */ /*+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes apache#24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

pengbo · 2019-04-11T16:21:46Z

It would be for performance, for the same reason all the basic spark classes are registered; otherwise Kryo serializes the class name with each instance. Why wouldn't it matter? maybe I'm missing something

The kyro & java serialization are using the same method that's why I think it would makes no different.
I am not quite familiar with how kyro is using in spark yet, let me know what i'm missing and the following step I can do.

Thanks

srowen · 2019-04-11T16:28:38Z

I'm thinking of what KryoSerializer.newKryo does; I'm guessing we should register all Unsafe Kryo-serializable classes by default

… postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes apache#24314 from srowen/SPARK-27404. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ding with `.sql` ## What changes were proposed in this pull request? While using vi or vim to edit the test files the .swp or .swo files are created and attempt to run the test suite in the presence of these files causes errors like below : ``` nfo] - subquery/exists-subquery/.exists-basic.sql.swp *** FAILED *** (117 milliseconds) [info] java.io.FileNotFoundException: /Users/dbiswal/mygit/apache/spark/sql/core/target/scala-2.12/test-classes/sql-tests/results/subquery/exists-subquery/.exists-basic.sql.swp.out (No such file or directory) [info] at java.io.FileInputStream.open0(Native Method) [info] at java.io.FileInputStream.open(FileInputStream.java:195) [info] at java.io.FileInputStream.<init>(FileInputStream.java:138) [info] at org.apache.spark.sql.catalyst.util.package$.fileToString(package.scala:49) [info] at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:247) [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$11(SQLQueryTestSuite.scala:192) ``` ~~This minor pr adds these temp files in the ignore list.~~ While computing the list of test files to process, only consider files with `.sql` extension. This makes sure the unwanted temp files created from various editors are ignored from processing. ## How was this patch tested? Verified manually. Closes apache#24333 from dilipbiswal/dkb_sqlquerytest. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…lity ## What changes were proposed in this pull request? `Krb5LoginModule` supports debug parameter which is not yet supported from Spark side. This configuration makes it easier to debug authentication issues against Kafka. In this PR `Krb5LoginModule` debug flag controlled by either `sun.security.krb5.debug` or `com.ibm.security.krb5.Krb5Debug`. Additionally found some hardcoded values like `ssl.truststore.location`, etc... which could be error prone if Kafka changes it so in such cases Kafka define used. ## How was this patch tested? Existing + additional unit tests + on cluster. Closes apache#24204 from gaborgsomogyi/SPARK-27270. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

… and FromUnixTime ## What changes were proposed in this pull request? SPARK-27199 introduced the use of `ZoneId` instead of `TimeZone` in a few date/time expressions. There were 3 occurrences of `ctx.addReferenceObj("zoneId", zoneId)` in that PR, which had a bug because while the `java.time.ZoneId` base type is public, the actual concrete implementation classes are not public, so using the 2-arg version of `CodegenContext.addReferenceObj` would incorrectly generate code that reference non-public types (`java.time.ZoneRegion`, to be specific). The 3-arg version should be used, with the class name of the referenced object explicitly specified to the public base type. One of such occurrences was caught in testing in the main PR of SPARK-27199 (apache#24141), for `DateFormatClass`. But the other 2 occurrences slipped through because there were no test cases that covered them. Example of this bug in the current Apache Spark master, in a Spark Shell: ``` scala> Seq(("2016-04-08", "yyyy-MM-dd")).toDF("s", "f").repartition(1).selectExpr("to_unix_timestamp(s, f)").show ... java.lang.IllegalAccessError: tried to access class java.time.ZoneRegion from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1 ``` This PR fixes the codegen issues and adds the corresponding unit tests. ## How was this patch tested? Enhanced tests in `DateExpressionsSuite` for `to_unix_timestamp` and `from_unixtime`. Closes apache#24352 from rednaxelafx/fix-spark-27199. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

pengbo · 2019-04-12T05:40:54Z

@srowen done. Let me know if you have any other advice.

… when two machines have different Oops size

… for better control

… when two machines have different Oops size

… for better control

…ARK-27416

mingbo_pb added 7 commits April 10, 2019 23:41

SPARK-27416:UnsafeMapData & UnsafeArrayData Kryo serialization breaks…

b6932e1

… when two machines have different Oops size

refine code by remove useless method getBytes

3876b8e

Merge remote-tracking branch 'upstream/master' into SPARK-27416

8bee231

java code style fixing

3a06348

fix java code style issue

3a8990e

remove useless else control

3a8ed0f

make UnsafeDataUtils final & package private with private constructor…

fb566bc

… for better control

srowen and others added 5 commits April 11, 2019 13:43

add unsafe data class to KryoSerializer.newKryo

9b7bef0

mingbo_pb and others added 11 commits April 12, 2019 17:44

SPARK-27416:UnsafeMapData & UnsafeArrayData Kryo serialization breaks…

d437605

… when two machines have different Oops size

refine code by remove useless method getBytes

72b4f92

java code style fixing

39ef9f9

fix java code style issue

0a5ff14

remove useless else control

e901266

make UnsafeDataUtils final & package private with private constructor…

7b7b01d

… for better control

add unsafe data class to KryoSerializer.newKryo

f21d832

SPARK-27416:UnsafeMapData & UnsafeArrayData Kryo serialization breaks…

a52f335

… when two machines have different Oops size

refine code by remove useless method getBytes

cbab6ae

java code style fixing

ac0d7a4

fix java code style issue

35b0d79

pengbo added 3 commits April 12, 2019 17:54

remove useless else control

08e91ab

make UnsafeDataUtils final & package private with private constructor…

9a47061

… for better control

Merge branch 'SPARK-27416' of https://github.com/pengbo/spark into HEAD

e0f1f26

pengbo closed this Apr 12, 2019

Merge branch 'SPARK-27416' of https://github.com/pengbo/spark into SP…

40b768f

…ARK-27416

pengbo reopened this Apr 12, 2019

pengbo closed this Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349

[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

srowen commented Apr 11, 2019

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

srowen commented Apr 11, 2019

Uh oh!

pengbo commented Apr 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349

[Spark-27416][SQL]UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size #24349

Uh oh!

Conversation

pengbo commented Apr 11, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

srowen commented Apr 11, 2019

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

srowen commented Apr 11, 2019

Uh oh!

pengbo commented Apr 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants