Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive validation πŸ”Ž, 30+ fixes integrated/added πŸ”¨πŸ›, optimized performance πŸš€ #159

Open
wants to merge 62 commits into
base: master
Choose a base branch
from

Conversation

klondikedragon
Copy link

This package is amazing and hugely popular, and has been the best package for automatic date parsing in go for years! ⭐

Thanks @araddon for crafting this package with love over the years!!

I've been using this while developing a new cloud-based log aggregation/search/visualization product, and I've found that there are three major opportunities for improvement for my particular use case:

  • The package does not strictly validate its input, leading to many false positives. This is OK if you know for sure the input matches one of the known formats, but cannot be trusted if the input could be anything and you only want a returned date/time if it definitely matches a known format.
  • While still being far more efficient than the "shotgun" parsing approach, the package currently allocates a relatively large amount of memory (several times the average input size), which can add up when parsing megabytes of date strings per second in a high-throughput microservice. It can also be relatively slow when parsing a string that doesn't match a known format and can allocate even more memory in this case, due to custom error messages that include contextual details.
  • There are a lot of unmerged one-off contributions that haven't been merged and need to be made coherent with each other.

This PR addresses all 3 opportunities:

  • Validation: it now comprehensively validates the input in the following ways: at each point in the state machine, it makes sure there are cases for all possibilities, and any invalid possibility will fail; additionally, it makes sure that the entire format string was specifically set (excluding any trailing punctuation, which can be safely ignored). False positives should be extremely rare now (hard to prove they don't exist).
  • Memory Efficiency: bytes allocated have been reduced by 90%, and parsing some formats are zero allocation. The SimpleErrorMessages option was added (off by default) that greatly speeds up the case where a string does not match a known format -- with the option on, this case is now 4x faster and produces almost no allocations.
  • Merge & Integrate Community Fixes: Fixes for nearly all of the pending issues from the community (including open pull requests) have been incorporated and adapted.

In the process of going through the state machine comprehensively for validation, redundant code/states were merged, and support was added for certain edge cases (for example, some date formats did not support being followed by times).

The example and README.md were updated to incorporate all of the newly supported formats and edge cases. More details on how to properly interpret returned location information with respect to abbreviated timezones was added.

BREAKING -- the package now requires go >= 1.20 to support memory optimizations converting from []byte to string in key places.

A huge thanks to all who posted issues and contributed PRs -- while the PRs were unable to be merged directly because the validation changes were so major, the ideas of all these contributions and the associated test cases were incorporated. Here's credit for all of the issues fixes and contributions in this PR as well as a summary of additional fixes added:

Also adds tests to verify that the following stay fixed:

arran4 and others added 30 commits February 15, 2023 15:40
* Don't just assume we were given one of the valid formats.
* Also consolidate the parsing states that occur after timePeriod.
* Add subtests to make it easier to see what fails.
* Additional tests for 4-char timezone names.
* Fix araddon#117
* Fix araddon#150
* Fix araddon#157
* Fix araddon#145
* Fix araddon#108
* Fix araddon#137
* Fix araddon#130
* Fix araddon#123
* Fix araddon#109
* Fix araddon#98
* Addresses bug in araddon#100 (comment)

Adds test cases to verify the following are already fixed:
* araddon#94
Incorporates PR araddon#133 from https://github.com/mehanizm to fix araddon#129

Adds test cases to verify the following are already fixed:
* araddon#105
Uses a memory pool for parser struct and format []byte

Uses a new go 1.20 feature to avoid allocations for []byte to string conversions in allowable cases.

go 1.20 also fixes a go bug for parsing fractional sec after a comma, so we can eliminate a workaround.

The remaining allocations are mostly unavoidable (e.g., time.Parse constructing a FixedZone location or part to strings.ToLower).

Results show an 89% reduction in allocated bytes for the big benchmark cases, and for some formats an allocation can be avoided entirely.

There is also a resulting 26% speedup in ns/op.

Details:

BEFORE:

cpu: 12th Gen Intel(R) Core(TM) i7-1255U
BenchmarkShotgunParse-12               19448 B/op        474 allocs/op
BenchmarkParseAny-12                    4736 B/op         42 allocs/op
BenchmarkBigShotgunParse-12          1075049 B/op      24106 allocs/op
BenchmarkBigParseAny-12               241422 B/op       2916 allocs/op
BenchmarkBigParseIn-12                244195 B/op       2984 allocs/op
BenchmarkBigParseRetryAmbiguous-12    260751 B/op       3715 allocs/op
BenchmarkShotgunParseErrors-12         67080 B/op       1679 allocs/op
BenchmarkParseAnyErrors-12             15903 B/op        200 allocs/op

AFTER:

BenchmarkShotgunParse-12               19448 B/op        474 allocs/op
BenchmarkParseAny-12                      48 B/op          2 allocs/op
BenchmarkBigShotgunParse-12          1075049 B/op      24106 allocs/op
BenchmarkBigParseAny-12                25394 B/op        824 allocs/op
BenchmarkBigParseIn-12                 28165 B/op        892 allocs/op
BenchmarkBigParseRetryAmbiguous-12     37880 B/op       1502 allocs/op
BenchmarkShotgunParseErrors-12         67080 B/op       1679 allocs/op
BenchmarkParseAnyErrors-12              3851 B/op        117 allocs/op
Previously, for ambiguous date strings, it was always calling parse twice even when the first parse would have been successful.

Refactor so that parsing isn't re-attempted unless the first parse fails ambiguously.

Benchmark results show that with RetryAmbiguousDateWithSwap(true), it's now about 6.5% faster (ns/op) and reduces allocated bytes by 3.4%.
Optimize the common and special case where mm and dd are the same length, just swap in place. Avoids having to reparse the entire string.

For this case, it's about 30% faster and reduces allocations by about 15%.

This format is especially common, hence the reason to optimize for this case.

Also fix the case for ambiguous date/time in the mm:dd:yyyy format.
Audit every stateDate so every unexpected alternative will fail.

In the process, fixed some newly found bugs:
* Extend format yyyy-mon-dd to allow times to follow it. Also allow full month name.
* Allow full day name before month (e.g., Monday January 4th, 2017)

Relevant confirmatory test cases were added.
New option SimpleErrorMessages that avoids allocation in the error path. It's off by default to preserve backwards compatibility.

Added benchmark BenchmarkBigParseAnyErrors that takes the big set of test cases, and injects errors to make them fail at pseudo-random places.

This optimization speeds up the error path runtime by 4x and reduces error path allocation bytes by 13x!
Reduces CPU usage on large benchmarks by ~2%-3% and prepares for future with international month names in future.
Audited all test cases to make sure an example was listed for all known formats.
Options were not being properly passed to recursive parseTime call.
@arran4
Copy link

arran4 commented Dec 21, 2023

Great work @klondikedragon

@jmdacruz
Copy link

this is great work @klondikedragon! Now, this repo hasn't seen much movement in years, do you think we should start using a fork? should we use yours?

@arran4
Copy link

arran4 commented Dec 24, 2023

I would vote that we use his, we should see if it qualifies for https://github.com/avelino/awesome-go

This is implemented now using the "skip" parser field, indicating
to skip the first N characters. This also avoids a recursive parse
in one more case (more efficient). This simplifies the state machine
a little bit, while the rest of the code needs to properly account
for the value of the skip field.

Also allow whitespace prefix without penalty.

Modify the test suite to psuedo-randomly add a weekday prefix
to the formats that allow it (all except the purely numeric ones).
@klondikedragon
Copy link
Author

In some further testing, I found that weekday prefixes only worked for some date formats, but not for others. So that is fixed now. As a side effect (benefit?), leading whitespace is now allowed/ignored.

Let's see if @araddon has feedback and/or is interested in merging this PR (it's a pretty big change and changes the philosophy a bit to have validation, and also makes the code a little more complex in favor of performance). The changes are large enough now it could break backwards compatibility, so in the very least it should deserve a new major version IMO.

Although I don't want to fork something lightly, since we haven't heard any feedback from @araddon for a few years, it could definitely make sense to go ahead. If there is no comment after the holidays, I think it would make sense to go ahead and fork. This package is a key part of freeform date parsing in the "automatic structured field extraction" logic being built for IT Lightning (this new cloud-based log management platform I'm building). Given that's the case, the IT Lightning org would be willing to maintain the forked repo and work with community issues/contributions, since we're motivated to have best-in-class date recognition & parsing in the log ingestion pipeline. The license would remain the same of course.

The community contributions would help us improve our date parsing, we'd be motivated to put energy into it to keep our date parsing bug-free and comprehensive, and community use of the package might help us get a little exposure to devs/SREs who might become interested in our log management solution. So it should benefit everyone.

All feedback is welcome. What do ya'll think of this proposed plan?

* Merge duplicate states (fixes lots of edge cases)
* Support for +00:00 is consistent with +0000 now
* Support (timezone description) after any offset/name
* Update tests to cover positive/negative cases
* Update example with new supported formats
Fully support the format where a TZ name is in parentheses after the
time (and possibly after an offset). This fixes the broken case where a
4 character TZ name was in parentheses after a time.
@elliot40404
Copy link

great work @klondikedragon . How can i start using this?

@klondikedragon klondikedragon deleted the branch araddon:master January 9, 2024 01:59
@klondikedragon klondikedragon deleted the master branch January 9, 2024 01:59
@klondikedragon
Copy link
Author

I'll go ahead and fork this package. I'm renaming the main branch as part of that.

@klondikedragon
Copy link
Author

The fork is complete and published as v0.1.0 -- again, a huge thanks to @araddon for authoring and maintaining this package for so many years!

The fork is available using go get github.com/itlightning/dateparse -- issues and PRs are welcome.

@elliot40404 @arran4 @jmdacruz -- see what you think and how this updated package works! If this looks good and after incorporating feedback, I think I'll publish a v1.0.0 at some point soon. I'm also curious to get feedback on my log management project too, check out the site/discord if you're interested. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment