Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

In this test case, we are verifying that the result of an UDF is cached when the underlying data frame is cached and that the udf is not evaluated again when the cached data frame is used.

To reduce the runtime we do :

  1. Use a single partition dataframe, so the total execution time of UDF is more deterministic.
  2. Cut down the size of the dataframe from 10 to 2.
  3. Reduce the sleep time in the UDF from 5secs to 2secs.
  4. Reduce the failafter condition from 3 to 2.

With the above change, it takes about 4 secs to cache the first dataframe. And subsequent check takes a few hundred milliseconds.
The new runtime for 5 consecutive runs of this test is as follows :

[info] - cache UDF result correctly (4 seconds, 906 milliseconds)
[info] - cache UDF result correctly (4 seconds, 281 milliseconds)
[info] - cache UDF result correctly (4 seconds, 288 milliseconds)
[info] - cache UDF result correctly (4 seconds, 355 milliseconds)
[info] - cache UDF result correctly (4 seconds, 280 milliseconds)

How was this patch tested?

This is s test fix.

@dilipbiswal
Copy link
Contributor Author

cc @gatorsmile

test("cache UDF result correctly") {
val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh...since we fail after 2 seconds we may pass this even in case it doesn't work. Shall we put it to 3? or 2500 at least?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 OK, please correct me on this one. So we insert 2 rows .. i.e two invocation of the UDF amounting to 2 * 2sec = 4 secs of execution. So wouldn't a 2 sec fail time be ok ? Also marco, i did run this 10 times back to back without a problem - fyi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I do think this will pass 100% times, my concern was that in case of a regression we might fail detecting it. But yes, with the repartition to 1 you're right, I haven't considered it, otherwise they may have run in parallel. So this seems enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Thanks marco.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96980 has finished for PR 22638 at commit 97d18fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 2c6f4d6 Oct 6, 2018
@dilipbiswal
Copy link
Contributor Author

Thanks a lot @gatorsmile @mgaido91

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… cache UDF result correctly

## What changes were proposed in this pull request?
In this test case, we are verifying that the result of an UDF  is cached when the underlying data frame is cached and that the udf is not evaluated again when the cached data frame is used.

To reduce the runtime we do :
1) Use a single partition dataframe, so the total execution time of UDF is more deterministic.
2) Cut down the size of the dataframe from 10 to 2.
3) Reduce the sleep time in the UDF from 5secs to 2secs.
4) Reduce the failafter condition from 3 to 2.

With the above change, it takes about 4 secs to cache the first dataframe. And subsequent check takes a few hundred milliseconds.
The new runtime for 5 consecutive runs of this test is as follows :
```
[info] - cache UDF result correctly (4 seconds, 906 milliseconds)
[info] - cache UDF result correctly (4 seconds, 281 milliseconds)
[info] - cache UDF result correctly (4 seconds, 288 milliseconds)
[info] - cache UDF result correctly (4 seconds, 355 milliseconds)
[info] - cache UDF result correctly (4 seconds, 280 milliseconds)
```
## How was this patch tested?
This is s test fix.

Closes apache#22638 from dilipbiswal/SPARK-25610.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants