-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comprehensive validation π, 30+ fixes integrated/added π¨π, optimized performance π #159
base: master
Are you sure you want to change the base?
Conversation
* Don't just assume we were given one of the valid formats. * Also consolidate the parsing states that occur after timePeriod. * Add subtests to make it easier to see what fails. * Additional tests for 4-char timezone names. * Fix araddon#117 * Fix araddon#150 * Fix araddon#157 * Fix araddon#145 * Fix araddon#108 * Fix araddon#137 * Fix araddon#130 * Fix araddon#123 * Fix araddon#109 * Fix araddon#98 * Addresses bug in araddon#100 (comment) Adds test cases to verify the following are already fixed: * araddon#94
Incorporates PR araddon#133 from https://github.com/mehanizm to fix araddon#129 Adds test cases to verify the following are already fixed: * araddon#105
Uses a memory pool for parser struct and format []byte Uses a new go 1.20 feature to avoid allocations for []byte to string conversions in allowable cases. go 1.20 also fixes a go bug for parsing fractional sec after a comma, so we can eliminate a workaround. The remaining allocations are mostly unavoidable (e.g., time.Parse constructing a FixedZone location or part to strings.ToLower). Results show an 89% reduction in allocated bytes for the big benchmark cases, and for some formats an allocation can be avoided entirely. There is also a resulting 26% speedup in ns/op. Details: BEFORE: cpu: 12th Gen Intel(R) Core(TM) i7-1255U BenchmarkShotgunParse-12 19448 B/op 474 allocs/op BenchmarkParseAny-12 4736 B/op 42 allocs/op BenchmarkBigShotgunParse-12 1075049 B/op 24106 allocs/op BenchmarkBigParseAny-12 241422 B/op 2916 allocs/op BenchmarkBigParseIn-12 244195 B/op 2984 allocs/op BenchmarkBigParseRetryAmbiguous-12 260751 B/op 3715 allocs/op BenchmarkShotgunParseErrors-12 67080 B/op 1679 allocs/op BenchmarkParseAnyErrors-12 15903 B/op 200 allocs/op AFTER: BenchmarkShotgunParse-12 19448 B/op 474 allocs/op BenchmarkParseAny-12 48 B/op 2 allocs/op BenchmarkBigShotgunParse-12 1075049 B/op 24106 allocs/op BenchmarkBigParseAny-12 25394 B/op 824 allocs/op BenchmarkBigParseIn-12 28165 B/op 892 allocs/op BenchmarkBigParseRetryAmbiguous-12 37880 B/op 1502 allocs/op BenchmarkShotgunParseErrors-12 67080 B/op 1679 allocs/op BenchmarkParseAnyErrors-12 3851 B/op 117 allocs/op
Previously, for ambiguous date strings, it was always calling parse twice even when the first parse would have been successful. Refactor so that parsing isn't re-attempted unless the first parse fails ambiguously. Benchmark results show that with RetryAmbiguousDateWithSwap(true), it's now about 6.5% faster (ns/op) and reduces allocated bytes by 3.4%.
Optimize the common and special case where mm and dd are the same length, just swap in place. Avoids having to reparse the entire string. For this case, it's about 30% faster and reduces allocations by about 15%. This format is especially common, hence the reason to optimize for this case. Also fix the case for ambiguous date/time in the mm:dd:yyyy format.
Audit every stateDate so every unexpected alternative will fail. In the process, fixed some newly found bugs: * Extend format yyyy-mon-dd to allow times to follow it. Also allow full month name. * Allow full day name before month (e.g., Monday January 4th, 2017) Relevant confirmatory test cases were added.
New option SimpleErrorMessages that avoids allocation in the error path. It's off by default to preserve backwards compatibility. Added benchmark BenchmarkBigParseAnyErrors that takes the big set of test cases, and injects errors to make them fail at pseudo-random places. This optimization speeds up the error path runtime by 4x and reduces error path allocation bytes by 13x!
Reduces CPU usage on large benchmarks by ~2%-3% and prepares for future with international month names in future.
Audited all test cases to make sure an example was listed for all known formats.
Options were not being properly passed to recursive parseTime call.
Great work @klondikedragon |
this is great work @klondikedragon! Now, this repo hasn't seen much movement in years, do you think we should start using a fork? should we use yours? |
I would vote that we use his, we should see if it qualifies for https://github.com/avelino/awesome-go |
This is implemented now using the "skip" parser field, indicating to skip the first N characters. This also avoids a recursive parse in one more case (more efficient). This simplifies the state machine a little bit, while the rest of the code needs to properly account for the value of the skip field. Also allow whitespace prefix without penalty. Modify the test suite to psuedo-randomly add a weekday prefix to the formats that allow it (all except the purely numeric ones).
In some further testing, I found that weekday prefixes only worked for some date formats, but not for others. So that is fixed now. As a side effect (benefit?), leading whitespace is now allowed/ignored. Let's see if @araddon has feedback and/or is interested in merging this PR (it's a pretty big change and changes the philosophy a bit to have validation, and also makes the code a little more complex in favor of performance). The changes are large enough now it could break backwards compatibility, so in the very least it should deserve a new major version IMO. Although I don't want to fork something lightly, since we haven't heard any feedback from @araddon for a few years, it could definitely make sense to go ahead. If there is no comment after the holidays, I think it would make sense to go ahead and fork. This package is a key part of freeform date parsing in the "automatic structured field extraction" logic being built for IT Lightning (this new cloud-based log management platform I'm building). Given that's the case, the IT Lightning org would be willing to maintain the forked repo and work with community issues/contributions, since we're motivated to have best-in-class date recognition & parsing in the log ingestion pipeline. The license would remain the same of course. The community contributions would help us improve our date parsing, we'd be motivated to put energy into it to keep our date parsing bug-free and comprehensive, and community use of the package might help us get a little exposure to devs/SREs who might become interested in our log management solution. So it should benefit everyone. All feedback is welcome. What do ya'll think of this proposed plan? |
* Merge duplicate states (fixes lots of edge cases) * Support for +00:00 is consistent with +0000 now * Support (timezone description) after any offset/name * Update tests to cover positive/negative cases * Update example with new supported formats
Fully support the format where a TZ name is in parentheses after the time (and possibly after an offset). This fixes the broken case where a 4 character TZ name was in parentheses after a time.
great work @klondikedragon . How can i start using this? |
I'll go ahead and fork this package. I'm renaming the main branch as part of that. |
The fork is complete and published as The fork is available using @elliot40404 @arran4 @jmdacruz -- see what you think and how this updated package works! If this looks good and after incorporating feedback, I think I'll publish a v1.0.0 at some point soon. I'm also curious to get feedback on my log management project too, check out the site/discord if you're interested. Thanks! |
This package is amazing and hugely popular, and has been the best package for automatic date parsing in go for years! β
Thanks @araddon for crafting this package with love over the years!!
I've been using this while developing a new cloud-based log aggregation/search/visualization product, and I've found that there are three major opportunities for improvement for my particular use case:
This PR addresses all 3 opportunities:
SimpleErrorMessages
option was added (off by default) that greatly speeds up the case where a string does not match a known format -- with the option on, this case is now 4x faster and produces almost no allocations.In the process of going through the state machine comprehensively for validation, redundant code/states were merged, and support was added for certain edge cases (for example, some date formats did not support being followed by times).
The example and README.md were updated to incorporate all of the newly supported formats and edge cases. More details on how to properly interpret returned location information with respect to abbreviated timezones was added.
BREAKING -- the package now requires go >= 1.20 to support memory optimizations converting from
[]byte
tostring
in key places.A huge thanks to all who posted issues and contributed PRs -- while the PRs were unable to be merged directly because the validation changes were so major, the ideas of all these contributions and the associated test cases were incorporated. Here's credit for all of the issues fixes and contributions in this PR as well as a summary of additional fixes added:
2015-06-10 00:00:00 GMT+02:00
Β #78 (support for GMT+offset, support additional whitespace in certain places, and other changes)"1.jpg"
incorrectly parsed as valid - fixed by comprehensive validationThu Jan 28 2021 15:28:21 GMT+0000
formatyyyy mon dd
format - adapt Add support for dates of the form "yyyy mmm dd" where mmm is alphaΒ #142 by @dferstayyyyymmddhhmmss.SSS
format - adapt support combined datetime format with subseconds (yyyyMMddhhmmss.SSS)Β #144 by @dferstay2017-04-03 22:32:14.322 CET
)Sun, 07 Jun 2020 00:00:00 +0100
format1 April 2022 23:59
format (time after certain date formats)mm.dd.yyyy (time)
- adapted fix problem with date format 02.01.2006Β #133 by @mehanizm (also fixes Better support for dot-format (e.g. 13.1.2009) with ambiguousMMDDΒ #91 and Request: Support for DD/MM/YYYYΒ #28)PMDT
andAMT
time zones and validate that AM/PM indicators only appear at most oncedd[th,nd,st,rd] Month yyyy
format - adapt New date format 1st November 2020Β #128 by @krhubert(time) UTC[+-]NNNN
mm/dd/yyyy, hh:mm:ss
format - adapt added comma format like04/2/2014, 03:00:37
Β #156 by @BrianLeishmanyyyy.mm.dd (time)
format - adapt Bug parsing 2014.02.13 00:00:00 ?Β #134 by @jmdacruz, and add cases expected to fail toTestParseErrors
unit testThu Apr 7 15:13:13 2005 -0700
) - adapt commit 99d9682 from Add support for git date formatΒ #92 by @jiangxin (merge timeWsYearOffset case and add validation)dd-mon-yyyy::hh:mm:ss
) - adapt rabbitmq log datetime supportΒ #122 by @bizy01mon/dd/yyyy
format, e.g.,Oct/31/1970
dd-month-year
formatyyyy-mon-dd
to allow times to follow it. Also allow full month name instead of just abbreviated.mm:dd:yyyy
formatMonday January 4th, 2017
)mm.dd.yyyy (time)
formatmm/dd
formats that start with a weekdayAlso adds tests to verify that the following stay fixed:
190910 11:51:49