Skip to content

Commit

Permalink
Streamline Regex path to matching, and improve Replace/Split (dotnet#…
Browse files Browse the repository at this point in the history
…1950)

* Add ThrowHelper, and clean up some style

Trying to streamline main path to the engine, ensuring helpers can be inlined, reducing boilerplate, etc.  And as long as ThrowHelper was being used in some places, used it in others where it didn't require adding additional methods.

Also cleaned style along the way.

* Streamline the Scan loop

The main costs remaining are the virtual calls to FindFirstChar/Go.

* Enumerate matches with a reusable match object in Regex.Replace/Split

The cost of re-entering the scan implementation and creating a new Match object for each NextMatch is measurable, but in these cases, we can use an iterator to quickly get back to where we were and reuse the match object.  It adds a couple interface calls per iteration, as well as an allocation for the enumerator, but it saves more than that in the common case.

* Use SegmentStringBuilder instead of ValueStringBuilder in Replace

A previous .NET Core release saw StringBuilder in Regex.Replace replaced by ValueStringBuilder.  This was done to avoid the allocations from the StringBuilder.  However, in some ways, for large input strings, it made things worse.  StringBuilder is implemented as a linked list of builders, whereas ValueStringBuilder is contiguous memory taken from the ArrayPool.  For large input strings, we start requesting buffers too large for the ArrayPool, and thus when we grow we generate large array allocations that become garbage.

We're better off using a simple arraypool-backed struct to store the segments that need to be concatenated, and then just creating the final string from those segments.  For the common case where there are many fewer replacements than the length of the string, this saves a lot of memory as well as avoiding a layer of copying.

* Replace previously added enumerator with a callback mechanism

The delegate invocation per match is faster than the two interface calls per match, plus we can avoid allocating the enumerator and just pass the relevant state through by ref.

* Remove unnecessary fields from MatchCollection

* More exception streamlining and style cleanup

* Address PR feedback, and fix merge with latest changes
  • Loading branch information
stephentoub committed Jan 21, 2020
1 parent e5444a1 commit c469f2d
Show file tree
Hide file tree
Showing 25 changed files with 1,100 additions and 956 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<Project Sdk="Microsoft.NET.Sdk">
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<AssemblyName>System.Text.RegularExpressions</AssemblyName>
<DefineConstants>$(DefineConstants);FEATURE_COMPILED</DefineConstants>
Expand All @@ -7,17 +7,18 @@
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<Compile Include="System\Collections\Generic\ValueListBuilder.Pop.cs" />
<Compile Include="System\Collections\HashtableExtensions.cs" />
<Compile Include="System\Collections\Generic\ValueListBuilder.Pop.cs" />
<Compile Include="System\Text\SegmentStringBuilder.cs" />
<Compile Include="System\Text\RegularExpressions\Capture.cs" />
<Compile Include="System\Text\RegularExpressions\CaptureCollection.cs" />
<Compile Include="System\Text\RegularExpressions\CollectionDebuggerProxy.cs" />
<Compile Include="System\Text\RegularExpressions\Group.cs" />
<Compile Include="System\Text\RegularExpressions\GroupCollection.cs" />
<Compile Include="System\Text\RegularExpressions\Match.cs" />
<Compile Include="System\Text\RegularExpressions\MatchCollection.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Cache.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Cache.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Match.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Replace.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Split.cs" />
Expand All @@ -41,6 +42,7 @@
<Compile Include="System\Text\RegularExpressions\RegexRunnerFactory.cs" />
<Compile Include="System\Text\RegularExpressions\RegexTree.cs" />
<Compile Include="System\Text\RegularExpressions\RegexWriter.cs" />
<Compile Include="System\Text\RegularExpressions\ThrowHelper.cs" />
<!-- Files that enable compiled feature -->
<Compile Include="System\Text\RegularExpressions\CompiledRegexRunnerFactory.cs" />
<Compile Include="System\Text\RegularExpressions\CompiledRegexRunner.cs" />
Expand All @@ -56,7 +58,6 @@
<Compile Include="$(CommonPath)System\Text\ValueStringBuilder.cs">
<Link>Common\System\Text\ValueStringBuilder.cs</Link>
</Compile>
<Compile Include="System\Text\ValueStringBuilder.Reverse.cs" />
</ItemGroup>
<ItemGroup>
<Reference Include="System.Buffers" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,6 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// Capture is just a location/length pair that indicates the
// location of a regular expression match. A single regexp
// search may return multiple Capture within each capturing
// RegexGroup.

namespace System.Text.RegularExpressions
{
/// <summary>
Expand All @@ -22,40 +17,27 @@ internal Capture(string text, int index, int length)
Length = length;
}

/// <summary>
/// Returns the position in the original string where the first character of
/// captured substring was found.
/// </summary>
/// <summary>Returns the position in the original string where the first character of captured substring was found.</summary>
public int Index { get; private protected set; }

/// <summary>
/// Returns the length of the captured substring.
/// </summary>
/// <summary>Returns the length of the captured substring.</summary>
public int Length { get; private protected set; }

/// <summary>
/// The original string
/// </summary>
/// <summary>The original string</summary>
internal string Text { get; private protected set; }

/// <summary>
/// Returns the value of this Regex Capture.
/// </summary>
public string Value => Text.Substring(Index, Length);

/// <summary>
/// Returns the substring that was matched.
/// </summary>
/// <summary>Returns the substring that was matched.</summary>
public override string ToString() => Value;

/// <summary>
/// The substring to the left of the capture
/// </summary>
internal ReadOnlySpan<char> GetLeftSubstring() => Text.AsSpan(0, Index);
/// <summary>The substring to the left of the capture</summary>
internal ReadOnlyMemory<char> GetLeftSubstring() => Text.AsMemory(0, Index);

/// <summary>
/// The substring to the right of the capture
/// </summary>
internal ReadOnlySpan<char> GetRightSubstring() => Text.AsSpan(Index + Length, Text.Length - Index - Length);
/// <summary>The substring to the right of the capture</summary>
internal ReadOnlyMemory<char> GetRightSubstring() => Text.AsMemory(Index + Length, Text.Length - Index - Length);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,12 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// The CaptureCollection lists the captured Capture numbers
// contained in a compiled Regex.

using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;

namespace System.Text.RegularExpressions
{
// This collection returns the Captures for a group
// in the order in which they were matched (left to right
// or right to left). It is created by Group.Captures.

/// <summary>
/// Represents a sequence of capture substrings. The object is used
/// to return the set of captures done by a single capturing group.
Expand All @@ -35,36 +28,32 @@ internal CaptureCollection(Group group)

public bool IsReadOnly => true;

/// <summary>
/// Returns the number of captures.
/// </summary>
/// <summary>Returns the number of captures.</summary>
public int Count => _capcount;

/// <summary>
/// Returns a specific capture, by index, in this collection.
/// </summary>
/// <summary>Returns a specific capture, by index, in this collection.</summary>
public Capture this[int i] => GetCapture(i);

/// <summary>
/// Provides an enumerator in the same order as Item[].
/// </summary>
/// <summary>Provides an enumerator in the same order as Item[].</summary>
public IEnumerator GetEnumerator() => new Enumerator(this);

IEnumerator<Capture> IEnumerable<Capture>.GetEnumerator() => new Enumerator(this);

/// <summary>
/// Returns the set of captures for the group
/// </summary>
/// <summary>Returns the set of captures for the group</summary>
private Capture GetCapture(int i)
{
if (i == _capcount - 1 && i >= 0)
if ((uint)i == _capcount - 1)
{
return _group;
}

if (i >= _capcount || i < 0)
throw new ArgumentOutOfRangeException(nameof(i));
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.i);
}

// first time a capture is accessed, compute them all
if (_captures == null)
if (_captures is null)
{
ForceInitialized();
Debug.Assert(_captures != null);
Expand All @@ -91,8 +80,10 @@ internal void ForceInitialized()

public void CopyTo(Array array, int arrayIndex)
{
if (array == null)
throw new ArgumentNullException(nameof(array));
if (array is null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.array);
}

for (int i = arrayIndex, j = 0; j < Count; i++, j++)
{
Expand All @@ -102,12 +93,18 @@ public void CopyTo(Array array, int arrayIndex)

public void CopyTo(Capture[] array, int arrayIndex)
{
if (array == null)
throw new ArgumentNullException(nameof(array));
if (arrayIndex < 0 || arrayIndex > array.Length)
throw new ArgumentOutOfRangeException(nameof(arrayIndex));
if (array is null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.array);
}
if ((uint)arrayIndex > (uint)array.Length)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.arrayIndex);
}
if (array.Length - arrayIndex < Count)
{
throw new ArgumentException(SR.Arg_ArrayPlusOffTooSmall);
}

for (int i = arrayIndex, j = 0; j < Count; i++, j++)
{
Expand All @@ -128,77 +125,57 @@ int IList<Capture>.IndexOf(Capture item)
return -1;
}

void IList<Capture>.Insert(int index, Capture item)
{
void IList<Capture>.Insert(int index, Capture item) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

void IList<Capture>.RemoveAt(int index)
{
void IList<Capture>.RemoveAt(int index) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

Capture IList<Capture>.this[int index]
{
get { return this[index]; }
set { throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection); }
get => this[index];
set => throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

void ICollection<Capture>.Add(Capture item)
{
void ICollection<Capture>.Add(Capture item) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

void ICollection<Capture>.Clear()
{
void ICollection<Capture>.Clear() =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

bool ICollection<Capture>.Contains(Capture item) =>
((IList<Capture>)this).IndexOf(item) >= 0;

bool ICollection<Capture>.Remove(Capture item)
{
bool ICollection<Capture>.Remove(Capture item) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

int IList.Add(object? value)
{
int IList.Add(object? value) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

void IList.Clear()
{
void IList.Clear() =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

bool IList.Contains(object? value) =>
value is Capture && ((ICollection<Capture>)this).Contains((Capture)value);
value is Capture other && ((ICollection<Capture>)this).Contains(other);

int IList.IndexOf(object? value) =>
value is Capture ? ((IList<Capture>)this).IndexOf((Capture)value) : -1;
value is Capture other ? ((IList<Capture>)this).IndexOf(other) : -1;

void IList.Insert(int index, object? value)
{
void IList.Insert(int index, object? value) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

bool IList.IsFixedSize => true;

void IList.Remove(object? value)
{
void IList.Remove(object? value) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

void IList.RemoveAt(int index)
{
void IList.RemoveAt(int index) =>
throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

object? IList.this[int index]
{
get { return this[index]; }
set { throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection); }
get => this[index];
set => throw new NotSupportedException(SR.NotSupported_ReadOnlyCollection);
}

private sealed class Enumerator : IEnumerator<Capture>
Expand All @@ -219,7 +196,9 @@ public bool MoveNext()
int size = _collection.Count;

if (_index >= size)
{
return false;
}

_index++;

Expand All @@ -231,18 +210,17 @@ public Capture Current
get
{
if (_index < 0 || _index >= _collection.Count)
{
throw new InvalidOperationException(SR.EnumNotStarted);
}

return _collection[_index];
}
}

object IEnumerator.Current => Current;

void IEnumerator.Reset()
{
_index = -1;
}
void IEnumerator.Reset() => _index = -1;

void IDisposable.Dispose() { }
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@ internal sealed class CollectionDebuggerProxy<T>
{
private readonly ICollection<T> _collection;

public CollectionDebuggerProxy(ICollection<T> collection)
{
public CollectionDebuggerProxy(ICollection<T> collection) =>
_collection = collection ?? throw new ArgumentNullException(nameof(collection));
}

[DebuggerBrowsable(DebuggerBrowsableState.RootHidden)]
public T[] Items
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// Group represents the substring or substrings that
// are captured by a single capturing group after one
// regular expression match.

namespace System.Text.RegularExpressions
{
/// <summary>
Expand All @@ -22,17 +18,14 @@ public class Group : Capture
internal CaptureCollection? _capcoll;

internal Group(string text, int[] caps, int capcount, string name)
: base(text, capcount == 0 ? 0 : caps[(capcount - 1) * 2],
capcount == 0 ? 0 : caps[(capcount * 2) - 1])
: base(text, capcount == 0 ? 0 : caps[(capcount - 1) * 2], capcount == 0 ? 0 : caps[(capcount * 2) - 1])
{
_caps = caps;
_capcount = capcount;
Name = name;
}

/// <summary>
/// Indicates whether the match is successful.
/// </summary>
/// <summary>Indicates whether the match is successful.</summary>
public bool Success => _capcount != 0;

public string Name { get; }
Expand All @@ -45,13 +38,14 @@ internal Group(string text, int[] caps, int capcount, string name)
public CaptureCollection Captures => _capcoll ??= new CaptureCollection(this);

/// <summary>
/// Returns a Group object equivalent to the one supplied that is safe to share between
/// multiple threads.
/// Returns a Group object equivalent to the one supplied that is safe to share between multiple threads.
/// </summary>
public static Group Synchronized(Group inner)
{
if (inner == null)
throw new ArgumentNullException(nameof(inner));
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.inner);
}

// force Captures to be computed.
CaptureCollection capcoll = inner.Captures;
Expand Down
Loading

0 comments on commit c469f2d

Please sign in to comment.