Do not reallocate whole input string when matching next token #119

mvorisek · 2024-05-31T20:53:56Z

Purely a performance refactoring.

The measured performance improvement is about -90% reduced runtime for the included test. With larger test/SQL, the speedup is even more dramatic as the original complexity was O(N^2).

Originally, the whole input string was reallocated after each token in:

https://github.com/doctrine/sql-formatter/blob/1.4.0/src/Tokenizer.php#L790

such big temporary string was upper cased for each token matching in:

https://github.com/doctrine/sql-formatter/blob/1.4.0/src/Tokenizer.php#L890

~~and regex patterns were constructed for each token matching freshly in:~~ (extracted into another PR)

https://github.com/doctrine/sql-formatter/blob/1.4.0/src/Tokenizer.php#L874

All such operations should be avoided for linear scalability and this PR addresses that.

mvorisek · 2024-06-02T08:23:13Z

PR is finished and the performance improvement (of the "long concat" tests with new 20k elements limit) is the following:

before:

21.370 secs Doctrine\SqlFormatter\Tests\SqlFormatterTest::testFormatLongConcat
18.474 secs Doctrine\SqlFormatter\Tests\TokenizerTest::testTokenizeLongConcat

after

5.710 secs Doctrine\SqlFormatter\Tests\SqlFormatterTest::testFormatLongConcat
1.665 secs Doctrine\SqlFormatter\Tests\TokenizerTest::testTokenizeLongConcat

So the tokenization process is newly about 10x faster and with increased query length the speedup is even bigger.

This PR is faster also on the tests/performance.php script even it uses very small queries:

before:

$ tests\performance.php                        
    <p>Formatted 54 queries</p>
    <p>Average query length of 284.37037 characters</p>
    <p>
        Took 0.02518 seconds total,
        0.00047 seconds per query,
        0.00164 seconds per 1000 characters
    </p>
    <p>Used 0 bytes of memory</p>

(2nd run)
$ tests\performance.php
    <p>Formatted 54 queries</p>
    <p>Average query length of 284.37037 characters</p>
    <p>
        Took 0.02403 seconds total,
        0.00044 seconds per query,
        0.00156 seconds per 1000 characters
    </p>
    <p>Used 0 bytes of memory</p>

after:

$ tests\performance.php
    <p>Formatted 54 queries</p>
    <p>Average query length of 284.37037 characters</p>
    <p>
        Took 0.01587 seconds total,
        0.00029 seconds per query,
        0.00103 seconds per 1000 characters
    </p>
    <p>Used 0 bytes of memory</p>

(2nd run)
$ tests\performance.php
    <p>Formatted 54 queries</p>
    <p>Average query length of 284.37037 characters</p>
    <p>
        Took 0.01543 seconds total,
        0.00029 seconds per query,
        0.00101 seconds per 1000 characters
    </p>
    <p>Used 0 bytes of memory</p>

mvorisek · 2024-06-11T11:44:15Z

Please split this; there seems to be changes that could be indenpendent, for instance "Remove historical comments" can be split out.

commit removed

greg0ire · 2024-06-11T11:45:00Z

Are there ways you could split this in several smaller PRs?

mvorisek · 2024-06-11T11:47:37Z

Are there ways you could split this in several smaller PRs?

Yes - done now, here we focus on "no reallocation", ie. using $offset strictly.

greg0ire · 2024-06-11T12:08:30Z

Is 3c2a6db about avoiding allocations? It seems to allocate more variables.

mvorisek · 2024-06-11T12:19:23Z

Is 3c2a6db about avoiding allocations? It seems to allocate more variables.

Such short allocations are fast and more readable over $str[$offset] and $str[$offset + 1]. Especially, single char string is always interned thus no real allocation is done. This is thanks to internal php optimization.

However I tested it - offset is slightly faster over substr($str, $offset, 1). I belive this is because specific opcode is used over regular function call/overhead.

greg0ire · 2024-06-11T12:40:51Z

Since it is not related (and even the opposite of what this PR is about, reducing allocations), then let's move the 2 commits in a separate PR please.

mvorisek · 2024-06-11T12:44:41Z

Like really, do I have to use separate PR for using separate variable when this PR is refactoring this topic and the performance was tested?

greg0ire · 2024-06-11T12:53:01Z

Do I have to deal with PRs with 7 commits? 7, really? What the hell?

greg0ire

This PR had better be strictly about removing allocations, not adding some.

`^` matches "string start", `\G` is the same but matches start given by the 5th `preg_match` `$offset` argument.

mvorisek · 2024-06-11T13:18:22Z

done

greg0ire · 2024-06-11T13:41:55Z

Thanks @mvorisek !

mvorisek changed the title ~~Do not reallocate sstring when matching next token~~ Do not reallocate string when matching next token May 31, 2024

mvorisek changed the title ~~Do not reallocate string when matching next token~~ Do not reallocate whole input string when matching next token May 31, 2024

mvorisek force-pushed the fix_perf branch 2 times, most recently from ba29e15 to ce7ccec Compare May 31, 2024 20:59

mvorisek marked this pull request as ready for review May 31, 2024 21:00

mvorisek force-pushed the fix_perf branch 13 times, most recently from 52a4fb4 to de3d8d9 Compare June 1, 2024 11:15

mvorisek marked this pull request as draft June 1, 2024 12:35

mvorisek mentioned this pull request Jun 1, 2024

Fix unclosed block comment tokenize #120

Merged

mvorisek force-pushed the fix_perf branch 2 times, most recently from 6fbb4ca to 06abf17 Compare June 2, 2024 00:41

mvorisek marked this pull request as ready for review June 2, 2024 00:43

mvorisek force-pushed the fix_perf branch from 384560f to f9b4d7d Compare June 2, 2024 00:52

greg0ire requested review from goetas, SenseException and derrabus June 2, 2024 09:48

mvorisek marked this pull request as draft June 3, 2024 10:34

mvorisek mentioned this pull request Jun 3, 2024

Drop unreachable error token type #122

Merged

mvorisek force-pushed the fix_perf branch from 0d1b09c to ca0bc3b Compare June 11, 2024 11:43

mvorisek force-pushed the fix_perf branch from ca0bc3b to 3f43a29 Compare June 11, 2024 11:46

mvorisek requested a review from greg0ire June 11, 2024 12:01

greg0ire requested changes Jun 11, 2024

View reviewed changes

mvorisek added 5 commits June 11, 2024 15:05

Refactor Tokenizer::createNextToken() to accept string and offset

4c9a805

Change "^" to "\G" in regexes to match offset start

15887a8

`^` matches "string start", `\G` is the same but matches start given by the 5th `preg_match` `$offset` argument.

Refactor Tokenizer::getNextQuotedString() to accept string and offset

b3b1c95

Refactor all "strtoupper($stringSlow)" code

459f48f

Refactor all remaining "$stringSlow" code

af658af

mvorisek force-pushed the fix_perf branch from 9886a69 to af658af Compare June 11, 2024 13:15

mvorisek requested a review from greg0ire June 11, 2024 13:17

mvorisek mentioned this pull request Jun 11, 2024

Refactor tokenizer with $firstChar/$secondChar #130

Merged

greg0ire approved these changes Jun 11, 2024

View reviewed changes

greg0ire added this to the 1.5.0 milestone Jun 11, 2024

greg0ire added the enhancement New feature or request label Jun 11, 2024

goetas approved these changes Jun 11, 2024

View reviewed changes

greg0ire merged commit 0bcd33e into doctrine:1.5.x Jun 11, 2024
10 checks passed

mvorisek deleted the fix_perf branch June 11, 2024 17:32

This was referenced Jun 11, 2024

Improve tokenizer regex matching #131

Merged

Build optimized regex from string list #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not reallocate whole input string when matching next token #119

Do not reallocate whole input string when matching next token #119

mvorisek commented May 31, 2024 •

edited

Loading

mvorisek commented Jun 2, 2024 •

edited

Loading

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

mvorisek commented Jun 11, 2024 •

edited

Loading

greg0ire commented Jun 11, 2024 •

edited

Loading

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

greg0ire left a comment

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

Do not reallocate whole input string when matching next token #119

Do not reallocate whole input string when matching next token #119

Conversation

mvorisek commented May 31, 2024 • edited Loading

mvorisek commented Jun 2, 2024 • edited Loading

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

mvorisek commented Jun 11, 2024 • edited Loading

greg0ire commented Jun 11, 2024 • edited Loading

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

greg0ire left a comment

Choose a reason for hiding this comment

mvorisek commented Jun 11, 2024

greg0ire commented Jun 11, 2024

mvorisek commented May 31, 2024 •

edited

Loading

mvorisek commented Jun 2, 2024 •

edited

Loading

mvorisek commented Jun 11, 2024 •

edited

Loading

greg0ire commented Jun 11, 2024 •

edited

Loading