Count substring occurrences in a string #73441

LeaFrock · 2022-08-05T08:12:32Z

LeaFrock
Aug 5, 2022

As the title describes, I find an interesting question.

using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

    [MemoryDiagnoser]
    public class SubStringCountTest
    {
        private readonly string Source;
        private readonly string Value;

        public SubStringCountTest()
        {
            Source = "eteeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeteeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeete";
            Value = "ete";
        }

        [Benchmark(Baseline = true)]
        public int Replace_Count()
        {
                string strReplaced = Source.Replace(Value, "");
                return (Source.Length - strReplaced.Length) / Value.Length;
        }

        [Benchmark]
        public int IndexOf_Count()
        {
            int count = 0;
            int index = Source.IndexOf(Value);
            while (index >= 0)
            {
                count++;
                index = Source.IndexOf(Value, index + Value.Length);
            }
            return count;
        }

        [Benchmark]
        public int Regex_Count()
        {
            return Regex.Matches(Source, Value).Count;
        }
    }

The benchmark result is that,

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1826 (21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
  [Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Allocated
Replace_Count	576.4 ns	2.74 ns	2.29 ns	1.00	0.00	0.0305	192 B
IndexOf_Count	8,495.4 ns	18.06 ns	16.01 ns	14.74	0.07	-	-
Regex_Count	585.6 ns	2.31 ns	2.05 ns	1.02	0.01	0.1221	768 B

I'm surprised about the result which tells String.IndexOf is slowest. Any comments?

Answered by EgorBo

Aug 5, 2022

Culture-aware IndexOf is slow by definition, it uses ICU under the hood. If it works for you consider setting Ordinal mode, e.g.:

public int IndexOf_Count()
{
    int count = 0;
    int index = Source.IndexOf(Value, StringComparison.Ordinal);
    while (index >= 0)
    {
        count++;
        index = Source.IndexOf(Value, index + Value.Length, StringComparison.Ordinal);
    }
    return count;
}

However, for the best performance you need .NET 7.0 where we introduced a new algorithm for IndexOf for substrings - #63285

.NET 6.0 CurrentCulture (ICU)

|        Method |     Mean |
|-------------- |---------:|
| IndexOf_Count | 358.8 ns |

.NET 6.0 Ordinal

|        Method |     Mean |
|----…

View full answer

EgorBo · 2022-08-05T09:32:58Z

EgorBo
Aug 5, 2022
Collaborator

Culture-aware IndexOf is slow by definition, it uses ICU under the hood. If it works for you consider setting Ordinal mode, e.g.:

public int IndexOf_Count()
{
    int count = 0;
    int index = Source.IndexOf(Value, StringComparison.Ordinal);
    while (index >= 0)
    {
        count++;
        index = Source.IndexOf(Value, index + Value.Length, StringComparison.Ordinal);
    }
    return count;
}

However, for the best performance you need .NET 7.0 where we introduced a new algorithm for IndexOf for substrings - #63285

.NET 6.0 CurrentCulture (ICU)

|        Method |     Mean |
|-------------- |---------:|
| IndexOf_Count | 358.8 ns |

.NET 6.0 Ordinal

|        Method |     Mean |
|-------------- |---------:|
| IndexOf_Count | 485.5 ns |

.NET 7.0 Ordinal

|        Method |     Mean |
|-------------- |---------:|
| IndexOf_Count | 38.80 ns |

1 reply

LeaFrock Aug 5, 2022
Author

That's awesome!

LeaFrock · 2022-08-05T10:02:24Z

LeaFrock
Aug 5, 2022
Author

This time I get a more reasonable result.

using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

    [MemoryDiagnoser]
    public class SubStringCountTest
    {
        private readonly string Source;
        private readonly string Value;

        public SubStringCountTest()
        {
            Source = "eteeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeteeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeete";
            Value = "ete";
        }

        [Benchmark(Baseline = true)]
        public int Replace_Count()
        {
                string strReplaced = Source.Replace(Value, "", StringComparison.Ordinal);
                return (Source.Length - strReplaced.Length) / Value.Length;
        }

        [Benchmark]
        public int IndexOf_Count()
        {
            int count = 0;
            int index = Source.IndexOf(Value, StringComparison.Ordinal);
            while (index >= 0)
            {
                count++;
                index = Source.IndexOf(Value, index + Value.Length, StringComparison.Ordinal);
            }
            return count;
        }

        [Benchmark]
        public int Regex_Count()
        {
            return Regex.Matches(Source, Value).Count;
        }
    }

The result:

Method	Mean	Error	StdDev	Ratio	Gen 0	Allocated
Replace_Count	588.9 ns	2.12 ns	1.99 ns	1.00	0.0305	192 B
IndexOf_Count	563.3 ns	0.82 ns	0.73 ns	0.96	-	-
Regex_Count	586.7 ns	1.47 ns	1.15 ns	1.00	0.1221	768 B

But if I modify Regex_Count with Regex.Matches(Source, Value, RegexOptions.Compiled).Count, it becomes fastest (about 350ns).

That's really interesting and I can't wait to see the result on .NET 7.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count substring occurrences in a string #73441

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Count substring occurrences in a string #73441

LeaFrock Aug 5, 2022

Replies: 2 comments · 1 reply

EgorBo Aug 5, 2022 Collaborator

LeaFrock Aug 5, 2022 Author

LeaFrock Aug 5, 2022 Author

LeaFrock
Aug 5, 2022

Replies: 2 comments 1 reply

EgorBo
Aug 5, 2022
Collaborator

LeaFrock Aug 5, 2022
Author

LeaFrock
Aug 5, 2022
Author