Skip to content

Optimize date parsing for format yyyy-MM-dd#15548

Merged
findepi merged 2 commits intotrinodb:masterfrom
ubyyj:optimize_standard_date_parsing
Jan 17, 2023
Merged

Optimize date parsing for format yyyy-MM-dd#15548
findepi merged 2 commits intotrinodb:masterfrom
ubyyj:optimize_standard_date_parsing

Conversation

@ubyyj
Copy link
Contributor

@ubyyj ubyyj commented Dec 29, 2022

Description

Optimize the performance of date parsing if the date is in standard format (yyyy-MM-dd).
The following functions are impacted:

date(str)
cast(str as date)
date_parse(str, '%Y-%m-%D')

After:
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 762.769 ± 35.198 ops/ms

Before:
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 48.887 ± 3.222 ops/ms

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`15229`)

@xumingming
Copy link
Contributor

xumingming commented Dec 29, 2022

Can you share performance results?

@ubyyj
Copy link
Contributor Author

ubyyj commented Dec 30, 2022

Can you have performance results?

I benchmarked the performance on my local laptop, before the optimization, the QPS is about 3M, after optimization, it's about 45M, so basically it's 15 times faster.

@findepi findepi requested review from lukasz-stec, pettyjamesm and skrzypo987 and removed request for findepi January 2, 2023 08:05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use OptionalInt as a return type, it's more explicit and avoids using magic Integer.MIN_VALUE as a wrong format marker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in cb70c37e453480e7b5891e117f63796e1ac450fb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What t stands for? Maybe rename it to digit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed in cb70c37e453480e7b5891e117f63796e1ac450fb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much benefit does the bit shifting brings over year = year * 10 + t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turned out there is no observable perf difference, so changed to year = year * 10 + t for readability in cb70c37e453480e7b5891e117f63796e1ac450fb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import java.time.LocalDate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in cb70c37e453480e7b5891e117f63796e1ac450fb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should tryParseDateYyyyMmDd support a negative year? e.g. -1970-01-01
also, add more test cases: e.g. min, max date that can be parsed, year zero, 29 feb leap year etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryParseDateYyyyMmDd will skip format like -1970-01-01, and will fall back to the original implementation.
Anyway, I will add more test cases as you suggested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more test cases added in cb70c37e453480e7b5891e117f63796e1ac450fb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not: int year = Integer.parseInt(value, 0, 4, 10) ? Same when parsing month and day, I suspect the Integer.parseInt(CharSequence, beginIndex, endIndex, radix) overload will perform similarly or better to hand-rolling here but I could be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.parseInt is about 30% slower, probably because there are more checks in that function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised we can do Integer.parseInt faster than Java. is their implementation just suboptimal?
in any case, it's feels better to encapsulate this in a utility method and use that for year, month and day. The code would be much more readable.

@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 8, 2023

@pettyjamesm @lukasz-stec would you please take a look again? All comments addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using java.util.OptionalInt as a return type? it would make it easier for jit to optimize code (avoid boxed type allocation etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion. Updated to OptionalInt in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't reuse this local variable. You can instead have something like

int monthHigh = value.charAt(5) - '0';
...
int monthLow = value.charAt(6) - '0';
...
int dayHigh = value.charAt(8) - '0';
...
int dayLow = value.charAt(9) - '0';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an invalid case for every if statement in the tryParseDateYyyyMmDd e.g. 197A-01-01, 197#-01-01 so that all code paths are covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a test case added for each code path in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we throw for this case and not for others above like 'Dec 24 2022'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022-02-29 has a valid format, but the value does not exist. This kind of error is hard to detect in our code, so delegate to LocalDate.of().

for case like Dec 24 2022, its format is wrong, and we can easily detect it in our code, we return OptionalInt.empty(), and delegate the behavior to the original implementation, to be backward compatible as much as possible..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a case where

        if (monthLow < 0 || monthLow > 9) {
            return OptionalInt.empty();
        }

or

    if (value.charAt(4) != '-') {
            return OptionalInt.empty();
        }

would be parsed correctly by DATE_FORMATTER.parseMillis?
Looks to me as if there is not. If that's the case, why not throw DateTimeException from tryParseDateYyyyMmDd if it's gonna fail anyway?

Copy link
Contributor Author

@ubyyj ubyyj Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DATE_FORMATTER.parseMillis support format other than yyyy-MM-DD, e.g. 99-12-31 and 1970-2-01.
Lets take 1970-2-01 for example, the monthLow would be the '-', and if (monthLow < 0 || monthLow > 9) returns true, but we can't say it's a invalid format, and throw here.
So we just delegate to DATE_FORMATTER.parseMillis for format other than yyyy-MM-DD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do check the length at the beginning but I guess with a possible negative year this could still be confused e.g. -1111-1-01.
Given #10677 we actually don't need to support negative years but I think it's better to separate those changes (perf improvement and backward incompatible change).

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm % benchmark and small comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do check the length at the beginning but I guess with a possible negative year this could still be confused e.g. -1111-1-01.
Given #10677 we actually don't need to support negative years but I think it's better to separate those changes (perf improvement and backward incompatible change).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a jmh benchmark for this similar to other benchmarks here e.g. io.trino.operator.scalar.BenchmarkArrayEqualOperator and post results before/after in the commit message and in the pr description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this variable inside the loop now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 72e06703cbffd04afc38670c2f73c497db9c748b

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a compile-time constant may not be a realistic benchmark (jit can do some tricks with constants).
I would suggest generating some number of dates to parse in the @setup and iterate over them if this method.

Another thing is that the result of the parseDate is ignored. This can make the compiler drop the whole computation (dead code elimination). To fix this you can use org.openjdk.jmh.infra.Blackhole to "consume" output of every parseDate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in f19d4a7cab505376b7162119a89dc8a4169b2815

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warmup time enough to give time to jit to compile the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extended both warmup time and measure time, results are quite stable. Changed in f19d4a7cab505376b7162119a89dc8a4169b2815

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing to do is to clean up commits. I would put the commit that adds benchmark first and squash the rest in the second commit

@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 12, 2023

hi, got some issue to reorder the commits. I touched the tryParseDateYyyyMmDd() function in the commit that adds benchmark (in 72e06703cbffd04afc38670c2f73c497db9c748b, move curDigit into the loop). If I put the commit that adds benchmark first, them there is conflicts, as that time there is no tryParseDateYyyyMmDd() yet.

So how about let's not swap the order, leave add benchmark commit the last one?

@skrzypo987 skrzypo987 removed their request for review January 12, 2023 07:14
@ubyyj ubyyj force-pushed the optimize_standard_date_parsing branch from f19d4a7 to 28935a9 Compare January 12, 2023 07:49
@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 12, 2023

I just squashed the 6 commits into 2:

  1. logic code change;
  2. add benchmark

@lukasz-stec
Copy link
Member

lukasz-stec commented Jan 12, 2023

I just squashed the 6 commits into 2:

  1. logic code change;
  2. add benchmark

Small comment.
Could you move the "[Add benchmark for DateTimeUtils.parseDate]" commit so that it is first? This way one can checkout this commit and run the benchmark "before" easily.
I would also move the before/after benchmark results to the other commit message as this is the commit that introduces the "after"

Benchmark                      Mode  Cnt   Score   Error   Units
BenchmarkParseDate.parseDate  thrpt   10  48.887 ± 3.222  ops/ms
@ubyyj ubyyj force-pushed the optimize_standard_date_parsing branch from 28935a9 to 5295033 Compare January 12, 2023 10:10
@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 12, 2023

ah I got your point why the order is important. Done

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
@findepi @sopel39 this is ready for maintainer review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be correct to call it iso8861DateFormat?

Copy link
Contributor Author

@ubyyj ubyyj Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, reuse io.trino.spi.type.Timestamps#MILLISECONDS_PER_DAY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseIfIso8861DateFormat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised we can do Integer.parseInt faster than Java. is their implementation just suboptimal?
in any case, it's feels better to encapsulate this in a utility method and use that for year, month and day. The code would be much more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also validate that it is 4 digits, then hyphen, then two digits, then hyphen, then two digits, the following code will need to further checks, and be much simpler

Copy link
Contributor Author

@ubyyj ubyyj Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the format at the beginning was our initial implementation, but it turned out to be 25% slower, because that way we need to call String.charAt() more. We need to call it during checking and parsing.
Though we can store the chars in an array when do format checking, but that also make code a little longer, and more GC pressure, since it needs to allocate an array each time.

below is the detailed benchmark result:
// mixing format check and parsing (current implementation)
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 742.748 ± 40.368 ops/ms

// check format at the begining
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 561.297 ± 29.083 ops/ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be private

if you want to test it directly, make it package-private and annotate with @VisibleForTesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static OptionalInt parseIntSimple(String str, int start, int len) {
private static OptionalInt parseIntSimple(String str, int start, int len)
{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"len number of chars to parse, and must >= 1" so throwing IAE would be appropriate

Copy link
Contributor Author

@ubyyj ubyyj Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this check, otherwise, for len==0, it will return OptionalInt.of(0) without throwing any exception, while OptionalInt.empty() is expected

Comment on lines 131 to 135
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can initialize result to 0 and then you can start for loop with i=0, and have 3 fewer lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out: currentDigit (or just digit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow https://github.com/airlift/codestyle to configure your IDE so that the code is well formatted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* parse positive integer with radix 10, with no null or length check.
* Parse positive integer with radix 10, with no null or length check.
*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this would be a bit more readable

if (value.length() != 10 || value.charAt(4) != '-' || value.charAt(7) != '-') {
    return OptionalInt.empty();
}

OptionalInt year = parseIntSimple(value, 0, 4);
if (year.isEmpty()) {
    return OptionalInt.empty();
}

OptionalInt month = parseIntSimple(value, 5, 2);
if (month.isEmpty()) {
    return OptionalInt.empty();
}

OptionalInt day = parseIntSimple(value, 8, 2);
if (day.isEmpty()) {
    return OptionalInt.empty();
}

LocalDate date = LocalDate.of(year.getAsInt(), month.getAsInt(), day.getAsInt());
return OptionalInt.of(toIntExact(date.toEpochDay()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • use assertThat(exp).isEmpty()
    • no need for re-assigned every time variable, each test is own independent line (or lines), exception message would include value contents if assertion fails
  • you can import parseIfIso8861DateFormat statically
assertThat(parseIfIso8861DateFormat("Dec 24 2022")).isEmpty();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9db1745425af8978c7dd0cfc4f3793cb610e1a33

Optimized the date parsing if it is in the format yyyy-MM-dd.

Below are the benchmark result before and after the change:

After:
Benchmark                      Mode  Cnt    Score    Error   Units
BenchmarkParseDate.parseDate  thrpt   10  762.769 ± 35.198  ops/ms

Before:
Benchmark                      Mode  Cnt   Score   Error   Units
BenchmarkParseDate.parseDate  thrpt   10  48.887 ± 3.222  ops/ms
@findepi findepi force-pushed the optimize_standard_date_parsing branch from 9db1745 to d3cf9c6 Compare January 17, 2023 10:08
@findepi findepi merged commit 0873b06 into trinodb:master Jan 17, 2023
@findepi findepi mentioned this pull request Jan 17, 2023
@github-actions github-actions bot added this to the 406 milestone Jan 17, 2023
private static final int MILLISECONDS_IN_HOUR = 60 * MILLISECONDS_IN_MINUTE;
private static final int MILLISECONDS_IN_DAY = 24 * MILLISECONDS_IN_HOUR;
private static final int PIVOT_YEAR = 2020; // yy = 70 will correspond to 1970 but 69 to 2069
private static final Slice ISO_8861_DATE_FORMAT = Slices.utf8Slice("%Y-%m-%d");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean ISO 8601 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups.. thanks!

#15756

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants