Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Feb 27, 2017

What changes were proposed in this pull request?

JIRA: SPARK-19745

Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a lazy val which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from #16037. Actually, probably all ML aggregators can benefit from this.

We can either: a.) separate the gradient improvement into another patch b.) keep what's here plus add the lazy evaluation to all other aggregators in this patch or c.) keep it as is.

How was this patch tested?

This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: numCoefficients * 8 bytes. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking.

@sethah
Copy link
Contributor Author

sethah commented Feb 27, 2017

ping @yanboliang @AnthonyTruchet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this eliminates the check for sparse coefficients, but I'm not sure it was ever necessary. We don't do it in other aggregators, this is a private class so we don't need to worry about people misusing it. Appreciate other thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to check it I think? At some point there was a BLAS operation used that only worked for dense vectors. I think during all the linear model refactor for 2.0/2.1 that was eliminated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this up a bit. I think using the check but marking the array as @transient lazy val is slightly more explicit about what we are trying to avoid here.

Copy link

@AnthonyTruchet AnthonyTruchet Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition with my review below...
Thanks for anticipating it, please ignore it then :-)

@sethah
Copy link
Contributor Author

sethah commented Feb 27, 2017

Jenkins test this please.

Copy link

@AnthonyTruchet AnthonyTruchet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me, just a request for a check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very confident in my understanding of @Transcient annotation when comined with lazy but I would expect it here. Can s.o. more versed confirm or infirm ?

http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/
http://stackoverflow.com/questions/34769220/difference-when-serializing-a-lazy-val-with-or-without-transient (unanswered)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this question is slightly different than what I was referring to above. We don't use @transient here because we do need to serialize this when we send the gradient updates back to the driver. The reason for making it lazy is because we don't need to serialize the array of zeros. We can just initialize it on the workers and avoid the communication cost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for caring to explain :-)

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73505 has finished for PR 17076 at commit 7f22675.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

@sethah a quick glance at the screenshots seems to indicate the processing time went up? Which seems a bit odd. Of course it's a small test so maybe just noise.

@sethah
Copy link
Contributor Author

sethah commented Feb 28, 2017

Those screenshots were only meant to show the shuffle write data. I'm not even sure the data sizes (number of rows) were the same, only the number of columns, which determines the shuffle write. I can do some performance testing if we think it's necessary.

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

Ah ok, makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are moving the dimension check here yes? Seems better than at L446 in fact

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

Looks fine to me - @sethah does the lazy val for gradient array impact LoR & Linear regression aggregators too? Any others? Since this is a fairly compact change, I think we can maybe go with your option (b) and do all the aggregators here in this PR.

@imatiach-msft
Copy link
Contributor

@sethah This is a nice optimization! Small suggestion -- For the shuffle write performance improvement, would it be possible to test this on a larger dataset to highlight the improvements and also to verify that other performance characteristics did not change (eg the duration, which @MLnick mentioned went up very slightly, although it seems due to a good reason, #rows was different)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the extra require be added:

require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")

similar to LogisticRegression.scala L1587 (in add method), since this now is more similar to the LogisticAggregator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch - LoR and LinR both have this check.

@imatiach-msft
Copy link
Contributor

@MLnick I checked LinearRegression.scala, LogisticRegression.scala, and AFTSurvivalRegression.scala, and all of them already do this. Are there any other aggregators with a similar code structure?

@yanboliang
Copy link
Contributor

@sethah Thanks for the good catch. I verified this optimization and found it indeed reduced the size of shuffle data. This looks good to me. BTW, like @MLnick 's suggestion, could you add the lazy evaluation for gradient array to all other aggregators in this PR? Since it's little change, I'd prefer to modify it here. Thanks.

@MLnick
Copy link
Contributor

MLnick commented Mar 2, 2017

@imatiach-msft LinearRegression, LogisticRegression and AFTSurvivalRegression do not have the lazy - they only do private val gradientSumArray ... so would need to be updated.

@yanboliang
Copy link
Contributor

+1 @MLnick Three lines change, updated here?

@MLnick
Copy link
Contributor

MLnick commented Mar 2, 2017

@yanboliang yeah I agree we can do it in this PR.

@imatiach-msft
Copy link
Contributor

@MLnick sorry, you're right, I somehow missed that the "lazy" was missing on gradientSumArray, would be great to update them as well

@sethah
Copy link
Contributor Author

sethah commented Mar 2, 2017

Thanks all for the review. I'll make the updates today.

@sethah
Copy link
Contributor Author

sethah commented Mar 3, 2017

Updated. Well I was able to verify that adding the lazy val does make the task binary broadcast variable smaller - i.e. it does not vary with the size of the features. We could potentially test this by serializing the closure and checking the size, but that might be for another JIRA or we can just leave it. We have not done things like this in the past (then again, we keep running into this issue).

We could also follow the pattern used in WeightedLeastSquares.Aggregator which initializes every aggregation variable to it's default value and then explicitly calls an init method. This is a bit clearer, but only a bit. Maybe we can discuss it further in #17094 which is partly rethinking this interface anyway.

@imatiach-msft
Copy link
Contributor

I took another brief look, the updates look great, thank you

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73794 has finished for PR 17076 at commit 19be223.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73792 has finished for PR 17076 at commit 207669f.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

yanboliang commented Mar 3, 2017

The failure is misinformation, I'll merge this into master, thanks for all.

@asfgit asfgit closed this in 93ae176 Mar 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants