Skip to content

Conversation

@pengbo
Copy link
Contributor

@pengbo pengbo commented Apr 11, 2019

What changes were proposed in this pull request?

Finish the rest work of #24317, #9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

How was this patch tested?

According Units has been added & tested

@pengbo
Copy link
Contributor Author

pengbo commented Apr 11, 2019

@srowen Sorry for closing #24340 accidentally

To Continue our discussion:

OK do we need to add these classes to the list that is automatically registered with Kryo by default too?

I am not quite sure what's the main advantage. If that's for performance, I think in this case it makes no difference.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 11, 2019

It would be for performance, for the same reason all the basic spark classes are registered; otherwise Kryo serializes the class name with each instance. Why wouldn't it matter? maybe I'm missing something

## What changes were proposed in this pull request?

This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better.

The hinted strategy will be used for the join with which it is associated if it is applicable/doable.

Conflict resolving rules in case of multiple hints:
1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) */ /*+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint.
2. Conflicts between two sides of the join:
  a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```.
  b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size.

## How was this patch tested?

Added new UTs.

Closes apache#24164 from maryannxue/join-hints.

Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@pengbo
Copy link
Contributor Author

pengbo commented Apr 11, 2019

It would be for performance, for the same reason all the basic spark classes are registered; otherwise Kryo serializes the class name with each instance. Why wouldn't it matter? maybe I'm missing something

The kyro & java serialization are using the same method that's why I think it would makes no different.
I am not quite familiar with how kyro is using in spark yet, let me know what i'm missing and the following step I can do.

Thanks

@srowen
Copy link
Member

srowen commented Apr 11, 2019

I'm thinking of what KryoSerializer.newKryo does; I'm guessing we should register all Unsafe Kryo-serializable classes by default

srowen and others added 5 commits April 11, 2019 13:43
… postfixOps edition

## What changes were proposed in this pull request?

Fix build warnings -- see some details below.

But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway.

## How was this patch tested?

Existing tests.

Closes apache#24314 from srowen/SPARK-27404.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
…ding with `.sql`

## What changes were proposed in this pull request?
While using vi or vim to edit the test files the .swp or .swo files are created and attempt to run the test suite in the presence of these files causes errors like below :
```
nfo] - subquery/exists-subquery/.exists-basic.sql.swp *** FAILED *** (117 milliseconds)
[info]   java.io.FileNotFoundException: /Users/dbiswal/mygit/apache/spark/sql/core/target/scala-2.12/test-classes/sql-tests/results/subquery/exists-subquery/.exists-basic.sql.swp.out (No such file or directory)
[info]   at java.io.FileInputStream.open0(Native Method)
[info]   at java.io.FileInputStream.open(FileInputStream.java:195)
[info]   at java.io.FileInputStream.<init>(FileInputStream.java:138)
[info]   at org.apache.spark.sql.catalyst.util.package$.fileToString(package.scala:49)
[info]   at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:247)
[info]   at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$11(SQLQueryTestSuite.scala:192)
```
~~This minor pr adds these temp files in the ignore list.~~
While computing the list of test files to process, only consider files with `.sql` extension. This makes sure the unwanted temp files created from various editors are ignored from processing.
## How was this patch tested?
Verified manually.

Closes apache#24333 from dilipbiswal/dkb_sqlquerytest.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…lity

## What changes were proposed in this pull request?

`Krb5LoginModule` supports debug parameter which is not yet supported from Spark side. This configuration makes it easier to debug authentication issues against Kafka.

In this PR `Krb5LoginModule` debug flag controlled by either `sun.security.krb5.debug` or `com.ibm.security.krb5.Krb5Debug`.

Additionally found some hardcoded values like `ssl.truststore.location`, etc... which could be error prone if Kafka changes it so in such cases Kafka define used.

## How was this patch tested?

Existing + additional unit tests + on cluster.

Closes apache#24204 from gaborgsomogyi/SPARK-27270.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
… and FromUnixTime

## What changes were proposed in this pull request?

SPARK-27199 introduced the use of `ZoneId` instead of `TimeZone` in a few date/time expressions.
There were 3 occurrences of `ctx.addReferenceObj("zoneId", zoneId)` in that PR, which had a bug because while the `java.time.ZoneId` base type is public, the actual concrete implementation classes are not public, so using the 2-arg version of `CodegenContext.addReferenceObj` would incorrectly generate code that reference non-public types (`java.time.ZoneRegion`, to be specific). The 3-arg version should be used, with the class name of the referenced object explicitly specified to the public base type.

One of such occurrences was caught in testing in the main PR of SPARK-27199 (apache#24141), for `DateFormatClass`. But the other 2 occurrences slipped through because there were no test cases that covered them.

Example of this bug in the current Apache Spark master, in a Spark Shell:
```
scala> Seq(("2016-04-08", "yyyy-MM-dd")).toDF("s", "f").repartition(1).selectExpr("to_unix_timestamp(s, f)").show
...
java.lang.IllegalAccessError: tried to access class java.time.ZoneRegion from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1
```

This PR fixes the codegen issues and adds the corresponding unit tests.

## How was this patch tested?

Enhanced tests in `DateExpressionsSuite` for `to_unix_timestamp` and `from_unixtime`.

Closes apache#24352 from rednaxelafx/fix-spark-27199.

Authored-by: Kris Mok <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@pengbo
Copy link
Contributor Author

pengbo commented Apr 12, 2019

@srowen done. Let me know if you have any other advice.

@pengbo pengbo closed this Apr 12, 2019
@pengbo pengbo reopened this Apr 12, 2019
@pengbo pengbo closed this Apr 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants