Skip to content

Conversation

@datumbox
Copy link
Contributor

What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

case v: Int => v // Avoid unnecessary casting
case v: Number =>
val intV = v.intValue()
if (v == intV) { // True for Byte/Short, Long within the Int range and Double/Float with no fractional part.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could write this differently and explicitly handle all the permitted types. Unfortunately this would lead to duplicate code. Instead what I do here is convert the number into Integer and compare it with the original Number. If the values are identical one of the following is true:

  • The value is Byte or Short.
  • The value is Long but within the Integer range.
  • The value is Double or Float but without any fractional part.

@datumbox datumbox changed the title Removed unnecessary castings and refactored checked casts in ALS. [SPARK-19733][ML]Removed unnecessary castings and refactored checked casts in ALS. Feb 25, 2017
@srowen
Copy link
Member

srowen commented Feb 25, 2017

I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.

@SparkQA
Copy link

SparkQA commented Feb 25, 2017

Test build #3582 has finished for PR 17059 at commit 98d6481.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@datumbox
Copy link
Contributor Author

@srowen I believe that this needs to be fixed for 2 reasons:

  1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
  2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.

Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.

@datumbox
Copy link
Contributor Author

Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)

@srowen
Copy link
Member

srowen commented Feb 25, 2017

I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.

@datumbox
Copy link
Contributor Author

Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.

@srowen
Copy link
Member

srowen commented Feb 25, 2017

You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.

@datumbox
Copy link
Contributor Author

jenkins retest this please

@datumbox
Copy link
Contributor Author

@srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)

@datumbox
Copy link
Contributor Author

datumbox commented Feb 26, 2017

I decided to provide a few more bencharks in order to alleviate some of the concerns raised by @srowen.

To reproduce the results add the following snippet in the ALSSuite class:

  test("Speed difference") {
    val (training, test) =
      genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)

    val runs = 1000
    var totalTime = 0.0
    println("Performing "+runs+" runs")
    println("Run Id,Time (secs)")
    for(i <- 0 until runs) {
      val t0 = System.currentTimeMillis
      testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
      val secs = (System.currentTimeMillis - t0)/1000.0
      println(i+","+secs)
      totalTime += secs
    }
    println("AVG Execution Time: "+(totalTime/runs)+" secs")
  }

To test both solutions, I collected 1000 samples for each (took ~1 hour for each). Here you can see the detailed output for the original and the proposed code.

Code Mean Execution Time Std
Original 4.75521 0.81237
Proposed 4.56276 0.72790

Using an unpaired t-test to compare the two means we find that the proposed code is faster and the result is statistically significant (p-value < .0001).

Below I summarize why I believe the original code needs to change:

  1. Casting user and item ids into double and then to integer is a hacky & indirect way to validate that the ids are numerical and within integer range. The proposed code covers all the corner cases in a clear and direct way. As an added bonus, it handles Doubles and Floats with fractional part.
  2. Given that the ALS implementation requires ids with int values, it is expected that the majority of users encode their Ids as Integer. The proposed solution avoids any casting in that case while reducing the casting in all the other cases. This avoids putting unnecessary strain on the garbage collector, something that you can observe if you profile the execution on a large dataset.
  3. The proposed solution is not slower than the original; if something it is slightly faster.

Could some one from the admins help re-test the code?

@srowen
Copy link
Member

srowen commented Feb 27, 2017

That's compelling regarding performance. It's not big but not trivial.
My remaining concern is whether you're handling all the cases the original did. Number covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.

I am still not sure about the Long case -- don't you want to handle Scala Long like Int here for efficiency?

@datumbox
Copy link
Contributor Author

@srowen: Thanks for the comments. We are getting there. :)

I will handle the Long case as you suggest.

If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?

@datumbox
Copy link
Contributor Author

Ignore my comment about duplicate code. It can be written to avoid it. I will investigate handling the SQL decimal types as you recommended and I will update the code tonight.

@datumbox
Copy link
Contributor Author

@srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.

As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range.

    val u = udf { (n: Any) =>
      n match {
        case v: Int => v
        case v: Long =>
          val intV = v.intValue
          if (v == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        //case v: Byte => v.toInt 
        //case v: Short => v.toInt
        case v: Number =>
          val intV = v.intValue
          if (v.doubleValue == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        case _ => throw new IllegalArgumentException("invalid type")
      }
    }

Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?

@srowen
Copy link
Member

srowen commented Feb 27, 2017

This one would be good for @MLnick to look at.

The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.

@datumbox
Copy link
Contributor Author

Yeah, Scala Long matches. Here is the "stand-alone" script that I used to confirm that everything works ok (tested on Spark 2.1):

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val u = udf { (n: Any) =>
  n match {
    case v: Int => v
    case v: Number =>
      val intV = v.intValue
      if (v.doubleValue == intV) {
        intV
      }
      else {
        throw new IllegalArgumentException("out of range")
      }
    case _ => throw new IllegalArgumentException("invalid type")
  }
}

val df = sqlContext.range(10)
  .withColumn("int_success", lit(123))
  .withColumn("long_success", lit(1231L))
  .withColumn("long_fail", lit(1231000000000L))
  .withColumn("decimal_success", lit(123).cast(DecimalType(5, 2)))
  .withColumn("decimal_fail", lit(123.1).cast(DecimalType(5, 2)))
  .withColumn("double_success", lit(123.0))
  .withColumn("double_fail", lit(123.1))
  .withColumn("double_fail2", lit(1231000000000.0))
  .withColumn("string_fail", lit("123.1"))

// these work fine
df.select(u(df.col("int_success"))).show
df.select(u(df.col("long_success"))).show
df.select(u(df.col("decimal_success"))).show
df.select(u(df.col("double_success"))).show

// these fail with out of int range exception
df.select(u(df.col("long_fail"))).show
df.select(u(df.col("decimal_fail"))).show
df.select(u(df.col("double_fail"))).show
df.select(u(df.col("double_fail2"))).show

// this fails with invalid type exception
df.select(u(df.col("string_fail"))).show

Cool, I'll commit the changes tonight so that @MLnick can check the final code.

@srowen I really appreciate your input; you did raise good points and especially for handling the SQL datatypes. Thank you.

@datumbox
Copy link
Contributor Author

@srowen @MLnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message.

The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation (details here). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.

Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?

case v: Number =>
val intV = v.intValue()
// Checks if number within Int range and has no fractional part.
if (v.doubleValue == intV) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure if this is a good idea due to floating point precision... the code above doesn't seem to do this check, it just calls toInt -- however, if this is absolutely necessary, I would hope that we could give the user some way to specify the Int range or precision. Also, if we are going to go ahead with this change, then we should add some tests to verify the case that the exception is thrown, but without some ability to specify the precision I'm not sure if this is the correct thing to do (?).

Copy link
Contributor Author

@datumbox datumbox Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imatiach-msft: In this snippet we deal with Uset id and Item id. Those things should no have fractional bits. What I do here is convert the number into Integer and compare its double value. If the values are identical one of the following is true:

  • The value is Byte or Short.
  • The value is Long but within the Integer range.
  • The value is Double or Float whithin the Integer range but without any fractional part.

I think this snippet is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the reality is we mostly expect Int or Long ids, but want to support any numeric ID as long as it falls within Int range - so passing in fractional float/double just doesn't make sense and we'll throw an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test does work, but may be slightly more direct to check if v.doubleValue % 1.0 == 0.0? this also works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is written it checks 2 things at the same time. 1) that it is within integer range (so if it overflows the equality will not hold) and 2) that it has no fractional part. The modulo check only covers you for point #2 but not #1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we for some reason do decide to go with the check against the fractional part, which I would not recommend, we should add a test to verify the error message is thrown, and possible add this change in accepted input format to the documentation, since it may break some users (they may have to do additional conversion to int for their values prior to calling this API if they run into such floating point precision issues).

Copy link
Contributor Author

@datumbox datumbox Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imatiach-msft I am aware of the implications of floating point precision and I understand your concerns.

Having said that though, even allowing user and item Ids to be double/float is not a good idea. We just keep it for backwards compatibility I guess. Also note that the current implementation of Spark 2.1 will actually take that your 0.9999999999999996 value and silently cast it to Int (so it becomes 0)! For me the only permitted types should have been Integer, Long and BigIntegers.

I don't have strong opinions about refactoring anything in the Number case as it simply performance-wise it does not matter. The point of this PR is to optimize the general case where the id is Int because the casting of the current approach generates twice as much data as the original dataset (of course it is GCed but at a cost).

@MLnick @srowen It's your call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox I agree that calling toInt in the previous code was not the best decision either; if we really wanted to support double type correctly we would have probably done the check with some precision. Maybe if we add to the documentation that double types for user/item should be avoided I would be ok with it.
Also, could you add a test case to verify the exception is thrown for doubles with a fraction, similar to the test case I sent you (with a catch on the exception type and withClue on the message)? We should clean up the error message too, since it is thrown not only when the value is not in Integer range (Int.Max/Min) but also when there is a fractional part.

Copy link
Contributor Author

@datumbox datumbox Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imatiach-msft I already tested it, check this snippet posted here. I can add it in ALSSuite if necessary. The exception message is technically correct since a number with a fractional part is not in Integer range (plus you get to see the actual value of the number). Returning a different exception would require having two separate if checks instead of one. Do we really want that?

Guys to be honest this PR solves a very simple thing and I did not anticipate to be so controvercial. May I suggest we agree on the final set of changes and merge it? Perhaps we can tackle any other concerns on new pull-requests?

Copy link
Contributor

@imatiach-msft imatiach-msft Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox Sorry the link doesn't seem to work for me. If you could add it to the ALSSuite that would be great. Oh no, I don't mean to have a different check, just to have a clearer exception message - something like "throw new IllegalArgumentException(s"ALS only supports values in Integer range and without fractional part for columns ${$(userCol)} and ${$(itemCol)}. Value $n was either out of Integer range or contained a fractional part that could not be converted.") "

As a general rule all checkins should have tests associated with them to verify functionality, and we should aim for 100% code coverage to validate all production code paths, even if it is not always realistically possible.

@imatiach-msft
Copy link
Contributor

@datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check

@datumbox
Copy link
Contributor Author

@imatiach-msft This comparison is intentional and checks 2 things: That the number is within integer range and that the Id does not have any non-zero digits after the decimal point. If the number is outside integer range the intValue will overflow and the number will not match its double part. Moreover if it has fractional part it will also will not match the integer.

Effectively we are making sure that the UserId and ItemId have values within integer range.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

@datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests here - admittedly vs 1.6.1 at that time

@datumbox
Copy link
Contributor Author

@MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

Ok, let me take a look at this. Thanks!

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73585 has finished for PR 17059 at commit fc3acfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Feb 28, 2017

@datumbox I added an additional comment -- since you believe we should keep the current code, we should add a test case and slightly modify the error message to let the user know what other error condition they are hitting (to be more user-friendly), please take a look, thanks!

@imatiach-msft
Copy link
Contributor

@datumbox thanks for adding the tests, I've approved the review, assuming the other reviewers are ok with minimal double support. I haven't seen withClue used in the same way you've used it, but it seems fine to me (usually the withClue would contain the error message, and you wouldn't have the extra assert). Thank you for the nice performance improvement!

@datumbox
Copy link
Contributor Author

@imatiach-msft
Copy link
Contributor

@datumbox ah ok, thanks for the link. LGTM!

@datumbox
Copy link
Contributor Author

datumbox commented Mar 1, 2017

I updated the exception message and I added Unit-tests to cover the checkedCast method as @imatiach-msft suggested. I have not corrected the value of regParam but I had a look and it is safe to do so.

Thanks everyone for their comments and recommendations. Please consider merging this PR with master because many people are working lately on the ALS and soon conflicts will arise.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73620 has finished for PR 17059 at commit eb33189.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few very minor comments.

val e: SparkException = intercept[SparkException] {
df.select(checkedCast( lit(1231000000000L) )).collect()
}
assert(e.getMessage.contains("either out of Integer range or contained a fractional part"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well factor this string out to a val.

val df = spark.sqlContext.range(1)

withClue("Valid Integer Ids") {
df.select(checkedCast( lit(123) )).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the spaces around lit(123) here and throughout for code style?


/**
* Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
* out of integer range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add to comment "... out of integer range or contains a fractional part"

.select(checkedCast(col($(userCol)).cast(DoubleType)),
checkedCast(col($(itemCol)).cast(DoubleType)), r)
.select(checkedCast(col($(userCol))),
checkedCast(col($(itemCol))), r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - but can this now go on one line if short enough?


test("CheckedCast") {
val checkedCast = new ALS().checkedCast
val df = spark.sqlContext.range(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be spark.range(1)?

assert(intercept[SparkException] {
als.fit(df.select(df("user_big").as("user"), df("item"), df("rating")))
}.getCause.getMessage.contains("was out of Integer range"))
}.getCause.getMessage.contains("out of Integer range or contained a fractional part"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well clean this message up into a val also while we're at it :)

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

The discussion around numeric types was at #12762 (comment). The checked cast was introduced because it was previously implicitly cast to Int anyway, and would mangle Long ids without any warning. I did think about only supporting Int & Long there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.

In hindsight I would have just gone for restricting it to Int & Long.

In reality in the case of ALS users really should not be using Double for ids (e.g. an id of 123.4 is clearly wrong, 123.0 could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

@datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.

For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

@imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)

@datumbox
Copy link
Contributor Author

datumbox commented Mar 1, 2017

@MLnick I accepted all proposed changes. Please retest. :)

Ok, I will create a new JIRA & PR for the regParam update but it will go down in history as the laziest commit ever. :p

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73693 has finished for PR 17059 at commit a1e32aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@datumbox
Copy link
Contributor Author

datumbox commented Mar 1, 2017

@srowen @MLnick I had a typo on the committed version... Could you please retest it? I don't think I have the rights to do it myself.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73697 has finished for PR 17059 at commit 0b49a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 1, 2017
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <[email protected]>

Closes apache#17121 from datumbox/als_regparam.
@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

@datumbox once a PR is ok'ed to test from a committer, it should automatically retest on each commit.

@datumbox
Copy link
Contributor Author

datumbox commented Mar 1, 2017

@MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73706 has finished for PR 17059 at commit 361d9cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73718 has finished for PR 17059 at commit 3050f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MLnick
Copy link
Contributor

MLnick commented Mar 2, 2017

Merged to master. Thanks @datumbox, and everyone for reviews.

@asfgit asfgit closed this in 625cfe0 Mar 2, 2017
@datumbox
Copy link
Contributor Author

datumbox commented Mar 2, 2017

@MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants