Skip to content

Conversation

@hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Jun 29, 2017

What changes were proposed in this pull request?

WindowExec currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by GeneratedMutableProjections to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the UnsafeSorterSpillReader reuses the buffer to which the UnsafeRow points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example FIRST, LAST, MIN & MAX.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

How was this patch tested?

Added a regression test to DataFrameWindowFunctionsSuite.

@gatorsmile
Copy link
Member

LGTM @liancheng @cloud-fan

@liancheng
Copy link
Contributor

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78932 has finished for PR 18470 at commit a41335f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78933 has finished for PR 18470 at commit 7e30ae5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retes this please

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jun 30, 2017

Test build #78941 has finished for PR 18470 at commit 7e30ae5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 30, 2017
…lling

## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.

Author: Herman van Hovell <[email protected]>

Closes #18470 from hvanhovell/SPARK-21258.

(cherry picked from commit e2f32ee)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2/2.1!

@asfgit asfgit closed this in e2f32ee Jun 30, 2017
asfgit pushed a commit that referenced this pull request Jun 30, 2017
…lling

## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.

Author: Herman van Hovell <[email protected]>

Closes #18470 from hvanhovell/SPARK-21258.

(cherry picked from commit e2f32ee)
Signed-off-by: Wenchen Fan <[email protected]>
@zzcclp
Copy link
Contributor

zzcclp commented Jun 30, 2017

Hi, @cloud-fan , @hvanhovell , after merging this pr into branch-2.1, there are some errors:
1.
value WINDOW_EXEC_BUFFER_SPILL_THRESHOLD is not a member of object org.apache.spark.sql.internal.SQLConf
2.
overloaded method value json with alternatives: (jsonRDD: org.apache.spark.rdd.RDD[String])org.apache.spark.sql.DataFrame <and> (jsonRDD: org.apache.spark.api.java.JavaRDD[String])org.apache.spark.sql.DataFrame <and> (paths: String*)org.apache.spark.sql.DataFrame <and> (path: String)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.sql.Dataset[String])

@cloud-fan
Copy link
Contributor

ah the spilling logic is not in 2.1, let me revert it, sorry for the trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants