-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not reallocate whole input string when matching next token #119
Conversation
ba29e15
to
ce7ccec
Compare
52a4fb4
to
de3d8d9
Compare
6fbb4ca
to
06abf17
Compare
PR is finished and the performance improvement (of the "long concat" tests with new 20k elements limit) is the following: before:
after
So the tokenization process is newly about 10x faster and with increased query length the speedup is even bigger. This PR is faster also on the before:
after:
|
commit removed |
Are there ways you could split this in several smaller PRs? |
Yes - done now, here we focus on "no reallocation", ie. using |
Is 3c2a6db about avoiding allocations? It seems to allocate more variables. |
Such short allocations are fast and more readable over However I tested it - offset is slightly faster over |
Since it is not related (and even the opposite of what this PR is about, reducing allocations), then let's move the 2 commits in a separate PR please. |
Like really, do I have to use separate PR for using separate variable when this PR is refactoring this topic and the performance was tested? |
Do I have to deal with PRs with 7 commits? 7, really? What the hell? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR had better be strictly about removing allocations, not adding some.
`^` matches "string start", `\G` is the same but matches start given by the 5th `preg_match` `$offset` argument.
done |
Thanks @mvorisek ! |
Purely a performance refactoring.
The measured performance improvement is about -90% reduced runtime for the included test. With larger test/SQL, the speedup is even more dramatic as the original complexity was O(N^2).
Originally, the whole input string was reallocated after each token in:
such big temporary string was upper cased for each token matching in:
and regex patterns were constructed for each token matching freshly in:(extracted into another PR)All such operations should be avoided for linear scalability and this PR addresses that.