Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Linux/x64: 2 Regressions in Regex on 7/11/2024 6:16:28 PM #104975

Closed
performanceautofiler bot opened this issue Jul 16, 2024 · 14 comments · Fixed by #105668
Closed

[Perf] Linux/x64: 2 Regressions in Regex on 7/11/2024 6:16:28 PM #104975

performanceautofiler bot opened this issue Jul 16, 2024 · 14 comments · Fixed by #105668
Assignees
Labels
arch-x64 area-System.Text.RegularExpressions os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime
Milestone

Comments

@performanceautofiler
Copy link

Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue ViperUbuntu
Baseline 8ba8249272917366e2382bb4c67a2347d19d2fb6
Compare cae3aec80d32316026d85228e12a94dd4887d57f
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
45.62 ms 48.73 ms 1.07 0.03 False
44.85 ms 48.43 ms 1.08 0.03 False

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig*'

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig.Count(Pattern: ".{2,4}(Tom|Sawyer|Huckleberry|Finn)", Options: NonBacktracking)

ETL Files

Histogram

JIT Disasms

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig.Count(Pattern: ".{0,2}(Tom|Sawyer|Huckleberry|Finn)", Options: NonBacktracking)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-x64 os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels Jul 16, 2024
@DrewScoggins DrewScoggins transferred this issue from dotnet/perf-autofiling-issues Jul 16, 2024
@DrewScoggins DrewScoggins changed the title [Perf] Linux/x64: 2 Regressions on 7/11/2024 6:16:28 PM [Perf] Linux/x64: 2 Regressions in Regex on 7/11/2024 6:16:28 PM Jul 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@DrewScoggins
Copy link
Member

Seems related to this pr, #102655.

@veanes
Copy link
Contributor

veanes commented Jul 16, 2024

My guess is that perf regression by 7%-8% on this (Twain) benchmark is related to a tradeoff in search heuristics
that on the other hand boosted other benchmarks, but I asked @ieviev to take a closer look into this too.

@ericstj ericstj removed the untriaged New issue has not been triaged by the area owner label Jul 17, 2024
@ericstj ericstj added this to the 9.0.0 milestone Jul 17, 2024
@ericstj
Copy link
Member

ericstj commented Jul 17, 2024

cc @stephentoub @veanes @ieviev

@ieviev
Copy link
Contributor

ieviev commented Jul 17, 2024

Some trade-offs were definitely expected, I will have a closer look at the benchmarks soon.

There's one substantial regression in dotnet/perf-autofiling-issues#38275, going from 55.81 μs to 76.42 μs with (?m)^Sherlock Holmes|Sherlock Holmes$.

I remember the patterns .{0,2}(Tom|Sawyer|Huckleberry|Finn) and .{2,4}(Tom|Sawyer|Huckleberry|Finn) were faster than before, perhaps it was from some changes along the way.

@ieviev
Copy link
Contributor

ieviev commented Jul 17, 2024

This reproduces on my machine as well:
image

The problem here is that we're doing more initial checks per match to speed up the inner loop which causes the 5% slowdown here, but we never really use the inner loop to its full extent either. These two benchmarks could be ~25% faster with findOptimizations disabled.

image

The change would be to turn this line here:

to

if (findOptimizations.FindMode is FindNextStartingPositionMode.LeadingSet_LeftToRight or FindNextStartingPositionMode.FixedDistanceSets_LeftToRight)
{
    if (findOptimizations.FixedDistanceSets![0].Negated)
    {
        _findOpts = null;
    }
}
else
{
    _findOpts = findOptimizations;
}

This would disable avx for searching negated sets like [^\n]. In the end of the PR i decided to avoid special casing since it's complex and hard to know when it's slower/faster but perhaps turning it off for large negated sets is reliable enough and speeds things up in most cases. Generally i think the speedup in other cases is big enough to justify some losses around 5% here and there, but there will be input-dependent tradeoffs either way.

@stephentoub thoughts on special casing this?

@ieviev
Copy link
Contributor

ieviev commented Jul 17, 2024

There's one substantial regression in dotnet/perf-autofiling-issues#38275, going from 55.81 μs to 76.42 μs with (?m)^Sherlock Holmes|Sherlock Holmes$.

This does not reproduce on my machine at all, can someone else confirm this?

image

@stephentoub
Copy link
Member

stephentoub commented Jul 22, 2024

@stephentoub thoughts on special casing this?

I'd be ok special-casing negated sets for non-backtracking, if we measure on our known set of benchmarks and demonstrate that it does more good than harm. I'd prefer, however, we do it as part of actually picking the optimizations, e.g. at

else
{
// Limit how many sets we use to avoid doing lots of unnecessary work. The list was already
// sorted from best to worst, so just keep the first ones up to our limit.
const int MaxSetsToUse = 3; // arbitrary tuned limit
if (fixedDistanceSets.Count > MaxSetsToUse)
{
fixedDistanceSets.RemoveRange(MaxSetsToUse, fixedDistanceSets.Count - MaxSetsToUse);
}
// Store the sets, and compute which mode to use.
FixedDistanceSets = fixedDistanceSets;
FindMode = (fixedDistanceSets.Count == 1 && fixedDistanceSets[0].Distance == 0) ?
FindNextStartingPositionMode.LeadingSet_LeftToRight :
FindNextStartingPositionMode.FixedDistanceSets_LeftToRight;
_asciiLookups = new uint[fixedDistanceSets.Count][];
}

rather than subsequently determining in the non-backtracking engine whether to store the ones already picked.

@stephentoub
Copy link
Member

That said, I'm still nervous about possible ramifications of turning off the find optimizations. It's quite possible it's the right choice, but doing it in response to this feels incongruent. We were using these find optimizations with these patterns both before and after the recent round of non-backtracking optimizations, right? So the problem is actually what's cited as "The problem here is that we're doing more initial checks per match to speed up the inner loop which causes the 5% slowdown here"... there's no way to re-reduce that overhead? Which are the initial checks you're referring to?

@ieviev
Copy link
Contributor

ieviev commented Jul 22, 2024

@stephentoub

That said, I'm still nervous about possible ramifications of turning off the find optimizations. It's quite possible it's the right choice, but doing it in response to this feels incongruent. We were using these find optimizations with these patterns both before and after the recent round of non-backtracking optimizations, right? So the problem is actually what's cited as "The problem here is that we're doing more initial checks per match to speed up the inner loop which causes the 5% slowdown here"... there's no way to re-reduce that overhead? Which are the initial checks you're referring to?

Yes, this is the same as before. My initial thoughts were that the overhead could be from the extra work done entering/leaving this function per-match but i'm uncertain about this.

private bool FindEndPositionDeltasDFAOptimized<TInitialStateHandler, TOptimizedNullabilityHandler>(
ReadOnlySpan<char> input, int lengthMinus1, RegexRunnerMode mode,
long timeoutOccursAt, ref int posRef, ref int currentStateIdRef, ref int endPosRef)
where TInitialStateHandler : struct, IInitialStateHandler
where TOptimizedNullabilityHandler : struct, IDfaNoZAnchorOptimizedNullabilityHandler
{
// Initial check for input end lifted out of the subsequent hot-path loop.
if (posRef == input.Length)
{
if (_stateArray[currentStateIdRef]!.IsNullableFor(_positionKinds[0]))
{
// the end position kind was nullable
endPosRef = posRef;
}
return true;
}
// To avoid frequent reads/writes to ref and out values, make and operate on local copies, which we then copy back once before returning.
int pos = posRef;
int currStateId = currentStateIdRef;
int endPos = endPosRef;
byte[] mintermsLookup = _mintermClassifier.ByteLookup!;
int deadStateId = _deadStateId;
int initialStateId = _initialStateId;
try
{
// The goal is to make this loop as fast as it can possibly be,
// every single piece of overhead should be removed here
while (true)
{
if (currStateId == deadStateId)
{
return true;
}
if (TInitialStateHandler.IsOptimized && currStateId == initialStateId)
{
TInitialStateHandler.TryFindNextStartingPosition(this, input, ref currStateId, ref pos, mintermsLookup);
if (pos == input.Length)
{
// patterns such as ^$ can be nullable right away
if (_stateArray[currStateId]!.IsNullableFor(_positionKinds[0]))
{
// the end position kind was nullable
endPos = pos;
}
currStateId = deadStateId;
return true;
}
}
// Get the next character.
char c = input[pos];
// If the state is nullable for the next character, we found a potential end state.
if (TOptimizedNullabilityHandler.IsNullable(this, _nullabilityArray[currStateId], c, mintermsLookup))
{
endPos = pos;
// A match is known to exist. If that's all we need to know, we're done.
if (mode == RegexRunnerMode.ExistenceRequired)
{
return true;
}
}
// If there is more input available try to transition with the next character.
// Note: the order here is important so the transition itself gets taken
if (!DfaStateHandler.TryTakeDFATransition(this, ref currStateId, GetMintermId(mintermsLookup, c), timeoutOccursAt) ||
pos >= lengthMinus1)
{
if (pos + 1 < input.Length)
{
return false;
}
pos++;
// one off check for the final position
// this is just to move it out of the hot loop
if (!(_stateFlagsArray[currStateId].IsNullable() ||
_stateArray[currStateId]!.IsNullableFor(GetPositionKind(-1))))
{
return true;
}
// the end position (-1) was nullable
endPos = pos;
return true;
}
// We successfully transitioned, so update our current input index to match.
pos++;
}
}
finally
{
// Write back the local copies of the ref values.
posRef = pos;
endPosRef = endPos;
currentStateIdRef = currStateId;
}
}

What i wanted to point out that if the performance of this individual benchmark is important then a lot more could be gained elsewhere. I understand it's better for maintainability overall to not make the engine too different from the rest, so it's definitely OK leaving it on as well.

Profiling said that 99% of the time is spent in FindEndPositionDeltasDFAOptimized, so the only sources of extra loop overhead compared to the pre-PR state i can think of could be this if case after prefix search here:

TInitialStateHandler.TryFindNextStartingPosition(this, input, ref currStateId, ref pos, mintermsLookup);
if (pos == input.Length)
{
// patterns such as ^$ can be nullable right away
if (_stateArray[currStateId]!.IsNullableFor(_positionKinds[0]))
{
// the end position kind was nullable
endPos = pos;
}
currStateId = deadStateId;
return true;
}

Another source of overhead could be memory: the minterm lookup array used to be strictly 128 elements for ascii, and the benchmark input only contains ascii, i think . here allocates more memory and perhaps uses a different level of cache.

With that said, the performance loss itself the way i see it is marginal. If there's no large performance losses and in exchange we increased performance somewhere else then it's fine leaving things as is. If ASCII performance is vital then making the engine work on bytes directly would benefit a lot more. if Regex.Count performance is vital then it could be a separate loop that doesn't do extra work per-match. If consistency is vital then findOptimizations could always be off to work like re2. It's not possible to win in all cases without making sacrifices somewhere.

@veanes
Copy link
Contributor

veanes commented Jul 26, 2024

Let me also add here that I agree with Ian here. I don't see it worthwhile to special case it and think we could leave it as is. I am going to meet with Ian in a couple of weeks in Estonia where one thing we will discuss is further steps for other optimizations.

@stephentoub
Copy link
Member

The regressions were indeed fixed (more than fixed) by #105668:

Image

@veanes
Copy link
Contributor

veanes commented Aug 5, 2024

Awesome that you figured out the fix for these regressions Stephen. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-System.Text.RegularExpressions os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants