Optimize date parsing for format yyyy-MM-dd#15548
Conversation
|
Can you share performance results? |
I benchmarked the performance on my local laptop, before the optimization, the QPS is about 3M, after optimization, it's about 45M, so basically it's 15 times faster. |
There was a problem hiding this comment.
Use OptionalInt as a return type, it's more explicit and avoids using magic Integer.MIN_VALUE as a wrong format marker.
There was a problem hiding this comment.
fixed in cb70c37e453480e7b5891e117f63796e1ac450fb
There was a problem hiding this comment.
What t stands for? Maybe rename it to digit?
There was a problem hiding this comment.
thanks, fixed in cb70c37e453480e7b5891e117f63796e1ac450fb
There was a problem hiding this comment.
how much benefit does the bit shifting brings over year = year * 10 + t?
There was a problem hiding this comment.
it turned out there is no observable perf difference, so changed to year = year * 10 + t for readability in cb70c37e453480e7b5891e117f63796e1ac450fb
There was a problem hiding this comment.
fixed in cb70c37e453480e7b5891e117f63796e1ac450fb
There was a problem hiding this comment.
should tryParseDateYyyyMmDd support a negative year? e.g. -1970-01-01
also, add more test cases: e.g. min, max date that can be parsed, year zero, 29 feb leap year etc
There was a problem hiding this comment.
tryParseDateYyyyMmDd will skip format like -1970-01-01, and will fall back to the original implementation.
Anyway, I will add more test cases as you suggested
There was a problem hiding this comment.
more test cases added in cb70c37e453480e7b5891e117f63796e1ac450fb
There was a problem hiding this comment.
Why not: int year = Integer.parseInt(value, 0, 4, 10) ? Same when parsing month and day, I suspect the Integer.parseInt(CharSequence, beginIndex, endIndex, radix) overload will perform similarly or better to hand-rolling here but I could be wrong.
There was a problem hiding this comment.
Integer.parseInt is about 30% slower, probably because there are more checks in that function.
There was a problem hiding this comment.
I am surprised we can do Integer.parseInt faster than Java. is their implementation just suboptimal?
in any case, it's feels better to encapsulate this in a utility method and use that for year, month and day. The code would be much more readable.
|
@pettyjamesm @lukasz-stec would you please take a look again? All comments addressed. |
There was a problem hiding this comment.
How about using java.util.OptionalInt as a return type? it would make it easier for jit to optimize code (avoid boxed type allocation etc.).
There was a problem hiding this comment.
thanks for the suggestion. Updated to OptionalInt in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9
There was a problem hiding this comment.
I wouldn't reuse this local variable. You can instead have something like
int monthHigh = value.charAt(5) - '0';
...
int monthLow = value.charAt(6) - '0';
...
int dayHigh = value.charAt(8) - '0';
...
int dayLow = value.charAt(9) - '0';
There was a problem hiding this comment.
updated in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9
There was a problem hiding this comment.
Add an invalid case for every if statement in the tryParseDateYyyyMmDd e.g. 197A-01-01, 197#-01-01 so that all code paths are covered.
There was a problem hiding this comment.
a test case added for each code path in 921b9c23e9894fdc57a581a29cb5d6aee89a99f9
There was a problem hiding this comment.
why do we throw for this case and not for others above like 'Dec 24 2022'?
There was a problem hiding this comment.
2022-02-29 has a valid format, but the value does not exist. This kind of error is hard to detect in our code, so delegate to LocalDate.of().
for case like Dec 24 2022, its format is wrong, and we can easily detect it in our code, we return OptionalInt.empty(), and delegate the behavior to the original implementation, to be backward compatible as much as possible..
There was a problem hiding this comment.
is there a case where
if (monthLow < 0 || monthLow > 9) {
return OptionalInt.empty();
}
or
if (value.charAt(4) != '-') {
return OptionalInt.empty();
}
would be parsed correctly by DATE_FORMATTER.parseMillis?
Looks to me as if there is not. If that's the case, why not throw DateTimeException from tryParseDateYyyyMmDd if it's gonna fail anyway?
There was a problem hiding this comment.
DATE_FORMATTER.parseMillis support format other than yyyy-MM-DD, e.g. 99-12-31 and 1970-2-01.
Lets take 1970-2-01 for example, the monthLow would be the '-', and if (monthLow < 0 || monthLow > 9) returns true, but we can't say it's a invalid format, and throw here.
So we just delegate to DATE_FORMATTER.parseMillis for format other than yyyy-MM-DD
There was a problem hiding this comment.
You do check the length at the beginning but I guess with a possible negative year this could still be confused e.g. -1111-1-01.
Given #10677 we actually don't need to support negative years but I think it's better to separate those changes (perf improvement and backward incompatible change).
lukasz-stec
left a comment
There was a problem hiding this comment.
generally lgtm % benchmark and small comments
There was a problem hiding this comment.
You do check the length at the beginning but I guess with a possible negative year this could still be confused e.g. -1111-1-01.
Given #10677 we actually don't need to support negative years but I think it's better to separate those changes (perf improvement and backward incompatible change).
There was a problem hiding this comment.
Could you add a jmh benchmark for this similar to other benchmarks here e.g. io.trino.operator.scalar.BenchmarkArrayEqualOperator and post results before/after in the commit message and in the pr description?
There was a problem hiding this comment.
you can move this variable inside the loop now
There was a problem hiding this comment.
done in 72e06703cbffd04afc38670c2f73c497db9c748b
There was a problem hiding this comment.
using a compile-time constant may not be a realistic benchmark (jit can do some tricks with constants).
I would suggest generating some number of dates to parse in the @setup and iterate over them if this method.
Another thing is that the result of the parseDate is ignored. This can make the compiler drop the whole computation (dead code elimination). To fix this you can use org.openjdk.jmh.infra.Blackhole to "consume" output of every parseDate
There was a problem hiding this comment.
done in f19d4a7cab505376b7162119a89dc8a4169b2815
There was a problem hiding this comment.
Is the warmup time enough to give time to jit to compile the code?
There was a problem hiding this comment.
extended both warmup time and measure time, results are quite stable. Changed in f19d4a7cab505376b7162119a89dc8a4169b2815
lukasz-stec
left a comment
There was a problem hiding this comment.
One last thing to do is to clean up commits. I would put the commit that adds benchmark first and squash the rest in the second commit
|
hi, got some issue to reorder the commits. I touched the tryParseDateYyyyMmDd() function in the commit that adds benchmark (in 72e06703cbffd04afc38670c2f73c497db9c748b, move curDigit into the loop). If I put the commit that adds benchmark first, them there is conflicts, as that time there is no tryParseDateYyyyMmDd() yet. So how about let's not swap the order, leave add benchmark commit the last one? |
f19d4a7 to
28935a9
Compare
|
I just squashed the 6 commits into 2:
|
Small comment. |
Benchmark Mode Cnt Score Error Units BenchmarkParseDate.parseDate thrpt 10 48.887 ± 3.222 ops/ms
28935a9 to
5295033
Compare
|
ah I got your point why the order is important. Done |
There was a problem hiding this comment.
would it be correct to call it iso8861DateFormat?
There was a problem hiding this comment.
updated in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0
There was a problem hiding this comment.
remove, reuse io.trino.spi.type.Timestamps#MILLISECONDS_PER_DAY
There was a problem hiding this comment.
done in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0
There was a problem hiding this comment.
renamed in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0
There was a problem hiding this comment.
I am surprised we can do Integer.parseInt faster than Java. is their implementation just suboptimal?
in any case, it's feels better to encapsulate this in a utility method and use that for year, month and day. The code would be much more readable.
There was a problem hiding this comment.
If you also validate that it is 4 digits, then hyphen, then two digits, then hyphen, then two digits, the following code will need to further checks, and be much simpler
There was a problem hiding this comment.
checking the format at the beginning was our initial implementation, but it turned out to be 25% slower, because that way we need to call String.charAt() more. We need to call it during checking and parsing.
Though we can store the chars in an array when do format checking, but that also make code a little longer, and more GC pressure, since it needs to allocate an array each time.
below is the detailed benchmark result:
// mixing format check and parsing (current implementation)
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 742.748 ± 40.368 ops/ms
// check format at the begining
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 561.297 ± 29.083 ops/ms
There was a problem hiding this comment.
This method should be private
if you want to test it directly, make it package-private and annotate with @VisibleForTesting
There was a problem hiding this comment.
done in d6f05bd8ec6e41ee32e70f8d6cfb128e447d74b0
There was a problem hiding this comment.
| private static OptionalInt parseIntSimple(String str, int start, int len) { | |
| private static OptionalInt parseIntSimple(String str, int start, int len) | |
| { |
There was a problem hiding this comment.
fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
There was a problem hiding this comment.
"len number of chars to parse, and must >= 1" so throwing IAE would be appropriate
There was a problem hiding this comment.
let's keep this check, otherwise, for len==0, it will return OptionalInt.of(0) without throwing any exception, while OptionalInt.empty() is expected
There was a problem hiding this comment.
You can initialize result to 0 and then you can start for loop with i=0, and have 3 fewer lines
There was a problem hiding this comment.
done in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
There was a problem hiding this comment.
spell out: currentDigit (or just digit)
There was a problem hiding this comment.
fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
There was a problem hiding this comment.
Please follow https://github.com/airlift/codestyle to configure your IDE so that the code is well formatted.
There was a problem hiding this comment.
| * parse positive integer with radix 10, with no null or length check. | |
| * Parse positive integer with radix 10, with no null or length check. | |
| * |
There was a problem hiding this comment.
fixed in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
There was a problem hiding this comment.
Arguably this would be a bit more readable
if (value.length() != 10 || value.charAt(4) != '-' || value.charAt(7) != '-') {
return OptionalInt.empty();
}
OptionalInt year = parseIntSimple(value, 0, 4);
if (year.isEmpty()) {
return OptionalInt.empty();
}
OptionalInt month = parseIntSimple(value, 5, 2);
if (month.isEmpty()) {
return OptionalInt.empty();
}
OptionalInt day = parseIntSimple(value, 8, 2);
if (day.isEmpty()) {
return OptionalInt.empty();
}
LocalDate date = LocalDate.of(year.getAsInt(), month.getAsInt(), day.getAsInt());
return OptionalInt.of(toIntExact(date.toEpochDay()));- dashes are checked upfront, so they don't distract flow later
// year,// monthcomments removed; variables names are meanigfulldrenamed todate. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#avoid-abbreviations
There was a problem hiding this comment.
applied in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
There was a problem hiding this comment.
- use
assertThat(exp).isEmpty()- no need for re-assigned every time variable, each test is own independent line (or lines), exception message would include value contents if assertion fails
- you can import parseIfIso8861DateFormat statically
assertThat(parseIfIso8861DateFormat("Dec 24 2022")).isEmpty();
There was a problem hiding this comment.
done in 9db1745425af8978c7dd0cfc4f3793cb610e1a33
Optimized the date parsing if it is in the format yyyy-MM-dd. Below are the benchmark result before and after the change: After: Benchmark Mode Cnt Score Error Units BenchmarkParseDate.parseDate thrpt 10 762.769 ± 35.198 ops/ms Before: Benchmark Mode Cnt Score Error Units BenchmarkParseDate.parseDate thrpt 10 48.887 ± 3.222 ops/ms
9db1745 to
d3cf9c6
Compare
| private static final int MILLISECONDS_IN_HOUR = 60 * MILLISECONDS_IN_MINUTE; | ||
| private static final int MILLISECONDS_IN_DAY = 24 * MILLISECONDS_IN_HOUR; | ||
| private static final int PIVOT_YEAR = 2020; // yy = 70 will correspond to 1970 but 69 to 2069 | ||
| private static final Slice ISO_8861_DATE_FORMAT = Slices.utf8Slice("%Y-%m-%d"); |
Description
Optimize the performance of date parsing if the date is in standard format (yyyy-MM-dd).
The following functions are impacted:
After:
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 762.769 ± 35.198 ops/ms
Before:
Benchmark Mode Cnt Score Error Units
BenchmarkParseDate.parseDate thrpt 10 48.887 ± 3.222 ops/ms
Additional context and related issues
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: