Skip to content

Conversation

@MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Dec 19, 2024

Fixes #110783

Adds 3 new FrozenSet imlpementations:

  • ByteFrozenSet - any set of bytes when using the default comparer
  • Latin1CharFrozenSet - set of chars <= 255 when using the default comparer
  • PerfectHashCharFrozenSet - any set of chars when using the default comparer

I extracted the lookup logic from ProbabilisticMapState in SearchValues to a searate helper so that there's practically no code duplication between the two.


Benchmark
public class Bench
{
    private static string Chars() => string.Concat(Enumerable.Range(0, 65536).Select(i => (char)i).Where(c => char.IsAscii(c) && char.IsSymbol(c)));
    private static readonly SearchValues<char> s_sv = SearchValues.Create(Chars());
    private static readonly FrozenSet<char> s_fs = FrozenSet.Create<char>(Chars());
    private static readonly string s_text = File.ReadAllText(@"C:\Users\mihaz\Downloads\gutenberg.org_files_1661_1661-0.txt");

    [Benchmark]
    public int CountSearchValues()
    {
        int count = 0;
        foreach (char c in s_text)
        {
            if (s_sv.Contains(c)) count++;
        }
        return count;
    }

    [Benchmark]
    public int CountFrozenSet()
    {
        int count = 0;
        foreach (char c in s_text)
        {
            if (s_fs.Contains(c)) count++;
        }
        return count;
    }
}

char.IsAscii && char.IsSymbol (Latin1CharFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 229.9 us 3.19 us 114 B
CountFrozenSet main 585.6 us 5.93 us 1.00 210 B
CountFrozenSet pr 228.7 us 2.73 us 0.39 108 B

char.IsSymbol (PerfectHashCharFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 265.4 us 3.34 us 102 B
CountFrozenSet main 1,508.3 us 20.19 us 1.00 141 B
CountFrozenSet pr 263.6 us 3.53 us 0.17 102 B

char.IsSymbol for bytes (ByteFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 193.5 us 2.60 us 74 B
CountFrozenSet main 1,529.5 us 22.44 us 1.00 131 B
CountFrozenSet pr 193.0 us 2.33 us 0.13 74 B

@MihaZupan MihaZupan added this to the 10.0.0 milestone Dec 19, 2024
@MihaZupan MihaZupan self-assigned this Dec 19, 2024
@MihaZupan
Copy link
Member Author

cc: @dotnet/area-system-collections (the bot was lazy)

@stephentoub
Copy link
Member

Can you point to some places where FrozenSet<byte> and FrozenSet<char> will be used?

@MihaZupan
Copy link
Member Author

Should be similar to SearchValues<byte>/SearchValues<char>, where you'd previously be using e.g. HashSet / string.Contains / bool[128].
Many uses would benefit from using SearchValues's ContainsAny though as they're just doing scalar loops around Contains themselves.

As a random example, here's one in Uri, though we could of course use SearchValues since we know the implementation:

internal static bool IsNotSafeForUnescape(char ch)

My main motivation was making the answer to "what should I use for Contains(T) checks" simply FrozenSet<T>, given that the interesting part of the implementation can be rather easily shared with SearchValues in this case.

@MihaZupan MihaZupan marked this pull request as draft January 28, 2025 14:01
@MihaZupan MihaZupan closed this Feb 8, 2025
@stephentoub
Copy link
Member

You decided it wasn't worthwhile?

Another option, if you think there is value in it, would be to only have the optimization on core and literally just wrap SearchValues.

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 8, 2025

I was intrigued by #111886 and wanted to play around with how that would look like for bitmap based FrozenSet, and a perfect hash based FrozenSet/FrozenDictionary.
Something like
main...MihaZupan:frozenset-bytechar3

That would merge ByteFrozenSet and Latin1CharFrozenSet from this PR into a single impl, even if it'd be a few cycles slower.

Not sure it's worth having the specialization for all the types though, even if generics makes it relatively "cheap" code-wise.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider improving FrozenSet for byte/char

4 participants