Skip to content
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -1444,9 +1444,9 @@ test_that("column functions", {
df <- createDataFrame(list(list(a = as.Date("1997-02-28"),
b = as.Date("1996-10-30"))))
result1 <- collect(select(df, alias(months_between(df[[1]], df[[2]]), "month")))[[1]]
expect_equal(result1, 3.93548387)
expect_equal(result1, 3.93429023)
result2 <- collect(select(df, alias(months_between(df[[1]], df[[2]], FALSE), "month")))[[1]]
expect_equal(result2, 3.935483870967742)
expect_equal(result2, 3.934290231832276)

# Test array_contains(), array_max(), array_min(), array_position(), element_at() and reverse()
df <- createDataFrame(list(list(list(1L, 2L, 3L)), list(list(6L, 5L, 4L))))
Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1122,9 +1122,9 @@ def months_between(date1, date2, roundOff=True):

>>> df = spark.createDataFrame([('1997-02-28 10:30:00', '1996-10-30')], ['date1', 'date2'])
>>> df.select(months_between(df.date1, df.date2).alias('months')).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, I would have expected months_between returns an integer, like just the difference in months ignoring day, but that's not what other DBs do. However browsing some links like https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.commsql.doc/doc/r0053631.html and https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/MONTHS_BETWEEN.htm I see that some implementations just assume all months including Feb have 31 days (!?) .

I agree that this is more accurate, but is it less consistent with Hive or other DBs? maybe it's already not consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The months_between is only particular case where the fix impacts on 3rd digit in the fractional part. I think it is more important to have more precise year/month duration in conversions of an interval to its duration in seconds (msec, usec) in other cases. Using 372 days per year leads to significant and visible errors in such conversions.

[Row(months=3.94959677)]
[Row(months=3.94866424)]
>>> df.select(months_between(df.date1, df.date2, False).alias('months')).collect()
[Row(months=3.9495967741935485)]
[Row(months=3.9486642436189654)]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.months_between(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1273,9 +1273,9 @@ case class AddMonths(startDate: Expression, numMonths: Expression)
examples = """
Examples:
> SELECT _FUNC_('1997-02-28 10:30:00', '1996-10-30');
3.94959677
3.94866424
> SELECT _FUNC_('1997-02-28 10:30:00', '1996-10-30', false);
3.9495967741935485
3.9486642436189654
""",
since = "1.5.0")
// scalastyle:on line.size.limit
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,8 @@

package org.apache.spark.sql.catalyst.plans.logical

import java.util.concurrent.TimeUnit

import org.apache.spark.sql.catalyst.expressions.Attribute
import org.apache.spark.sql.catalyst.util.DateTimeUtils.MILLIS_PER_MONTH
import org.apache.spark.sql.types.MetadataBuilder
import org.apache.spark.unsafe.types.CalendarInterval

Expand All @@ -28,9 +27,7 @@ object EventTimeWatermark {
val delayKey = "spark.watermarkDelayMs"

def getDelayMs(delay: CalendarInterval): Long = {
// We define month as `31 days` to simplify calculation.
val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
delay.milliseconds + delay.months * millisPerMonth
delay.milliseconds + delay.months * MILLIS_PER_MONTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need here is seconds per month, not seconds per year. I think we should still assume 31 days per month here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out where we need seconds/days per year in the codebase?

Copy link
Member Author

@MaxGekk MaxGekk Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe any place including this one when we need a duration (in seconds or its fractions). The difference between months_between() and this place is months_between uses month length to calculate fraction of month, and 28 or 31 days per months don't really matter because it impacts on 2nd or 3rd digit in fractions but here we operate on bigger numbers when months form years. And it becomes matter how much days we use per year. Let's say we calculate the duration of 10 years which 120 months. If we use 31 days per months, this duration is 31 * 120 = 10 * 372 = 3720 days but if one year is 365.2425 than 10 years = 3652 days. The difference is 3720 - 3652 = 68 days or the calculation error is more than 2 months. That's matter I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

months_between is sort of a special case because "31 days per month" is (it seems) actually how it is supposed to work, correctly.

It's rare that someone would specify "1 month" here, let alone "10 years" right? or am I missing something? these are things like watermark intervals. Not that it means the semantics don't matter, it's just quite a corner case.

I therefore just don't feel strongly either way about it. We don't need to match months_between semantics. More precision is nice, but surely it almost never comes up anyway? I don't mind the change, as a result.

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,15 @@ object DateTimeUtils {
final val MILLIS_PER_MINUTE: Long = 60 * MILLIS_PER_SECOND
final val MILLIS_PER_HOUR: Long = 60 * MILLIS_PER_MINUTE
final val MILLIS_PER_DAY: Long = SECONDS_PER_DAY * MILLIS_PER_SECOND
// The average year of the Gregorian calendar 365.2425 days long, see
// https://en.wikipedia.org/wiki/Gregorian_calendar
// Leap year occurs every 4 years, except for years that are divisible by 100
// and not divisible by 400. So, the mean length of of the Gregorian calendar year is:
// 1 mean year = (365 + 1/4 - 1/100 + 1/400) days = 365.2425 days
// The mean year length in seconds is:
// 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746
final val SECONDS_PER_MONTH: Int = 2629746
final val MILLIS_PER_MONTH: Long = SECONDS_PER_MONTH * MILLIS_PER_SECOND

// number of days between 1.1.1970 and 1.1.2001
final val to2001 = -11323
Expand Down Expand Up @@ -619,8 +628,7 @@ object DateTimeUtils {
val secondsInDay1 = MILLISECONDS.toSeconds(millis1 - daysToMillis(date1, timeZone))
val secondsInDay2 = MILLISECONDS.toSeconds(millis2 - daysToMillis(date2, timeZone))
val secondsDiff = (dayInMonth1 - dayInMonth2) * SECONDS_PER_DAY + secondsInDay1 - secondsInDay2
val secondsInMonth = DAYS.toSeconds(31)
val diff = monthDiff + secondsDiff / secondsInMonth.toDouble
val diff = monthDiff + secondsDiff / SECONDS_PER_MONTH.toDouble
if (roundOff) {
// rounding to 8 digits
math.round(diff * 1e8) / 1e8
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -488,13 +488,13 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
Literal(new Timestamp(sdf.parse("1997-02-28 10:30:00").getTime)),
Literal(new Timestamp(sdf.parse("1996-10-30 00:00:00").getTime)),
Literal.TrueLiteral,
timeZoneId = timeZoneId), 3.94959677)
timeZoneId = timeZoneId), 3.94866424)
checkEvaluation(
MonthsBetween(
Literal(new Timestamp(sdf.parse("1997-02-28 10:30:00").getTime)),
Literal(new Timestamp(sdf.parse("1996-10-30 00:00:00").getTime)),
Literal.FalseLiteral,
timeZoneId = timeZoneId), 3.9495967741935485)
timeZoneId = timeZoneId), 3.9486642436189654)

Seq(Literal.FalseLiteral, Literal.TrueLiteral). foreach { roundOff =>
checkEvaluation(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -385,8 +385,8 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers {
test("monthsBetween") {
val date1 = date(1997, 2, 28, 10, 30, 0)
var date2 = date(1996, 10, 30)
assert(monthsBetween(date1, date2, true, TimeZoneUTC) === 3.94959677)
assert(monthsBetween(date1, date2, false, TimeZoneUTC) === 3.9495967741935485)
assert(monthsBetween(date1, date2, true, TimeZoneUTC) === 3.94866424)
assert(monthsBetween(date1, date2, false, TimeZoneUTC) === 3.9486642436189654)
Seq(true, false).foreach { roundOff =>
date2 = date(2000, 2, 28)
assert(monthsBetween(date1, date2, roundOff, TimeZoneUTC) === -36)
Expand All @@ -399,8 +399,8 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers {
val date3 = date(2000, 2, 28, 16, tz = TimeZonePST)
val date4 = date(1997, 2, 28, 16, tz = TimeZonePST)
assert(monthsBetween(date3, date4, true, TimeZonePST) === 36.0)
assert(monthsBetween(date3, date4, true, TimeZoneGMT) === 35.90322581)
assert(monthsBetween(date3, date4, false, TimeZoneGMT) === 35.903225806451616)
assert(monthsBetween(date3, date4, true, TimeZoneGMT) === 35.91993675)
assert(monthsBetween(date3, date4, false, TimeZoneGMT) === 35.919936754348136)
}

test("from UTC timestamp") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@
package org.apache.spark.sql.execution.streaming

import java.sql.Date
import java.util.concurrent.TimeUnit

import org.apache.spark.sql.catalyst.plans.logical.{EventTimeTimeout, ProcessingTimeTimeout}
import org.apache.spark.sql.catalyst.util.DateTimeUtils.MILLIS_PER_MONTH
import org.apache.spark.sql.execution.streaming.GroupStateImpl._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}
import org.apache.spark.unsafe.types.CalendarInterval
Expand Down Expand Up @@ -164,8 +164,7 @@ private[sql] class GroupStateImpl[S] private(
throw new IllegalArgumentException(s"Provided duration ($duration) is not positive")
}

val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
cal.milliseconds + cal.months * millisPerMonth
cal.milliseconds + cal.months * MILLIS_PER_MONTH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting since this change doesn't affect our tests.

}

private def checkTimeoutTimestampAllowed(): Unit = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -341,13 +341,13 @@ class DateFunctionsSuite extends QueryTest with SharedSparkSession {
val s2 = "2015-10-01 00:00:00"
val df = Seq((t1, d1, s1), (t2, d2, s2)).toDF("t", "d", "s")
checkAnswer(df.select(months_between(col("t"), col("d"))), Seq(Row(-10.0), Row(7.0)))
checkAnswer(df.selectExpr("months_between(t, s)"), Seq(Row(0.5), Row(-0.5)))
checkAnswer(df.selectExpr("months_between(t, s, true)"), Seq(Row(0.5), Row(-0.5)))
checkAnswer(df.selectExpr("months_between(t, s)"), Seq(Row(0.5092507), Row(-0.4907493)))
checkAnswer(df.selectExpr("months_between(t, s, true)"), Seq(Row(0.5092507), Row(-0.4907493)))
Seq(true, false).foreach { roundOff =>
checkAnswer(df.select(months_between(col("t"), col("d"), roundOff)),
Seq(Row(-10.0), Row(7.0)))
checkAnswer(df.withColumn("r", lit(false)).selectExpr("months_between(t, s, r)"),
Seq(Row(0.5), Row(-0.5)))
Seq(Row(0.5092507032998624), Row(-0.49074929670013756)))
}
}

Expand Down