Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Alternative span/memory/string splitting API #76186

Closed
Tracked by #79053
stephentoub opened this issue Sep 26, 2022 · 27 comments · Fixed by #79048
Closed
Tracked by #79053

[API Proposal]: Alternative span/memory/string splitting API #76186

stephentoub opened this issue Sep 26, 2022 · 27 comments · Fixed by #79048
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 26, 2022

Background and motivation

#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}
  • Naming: we could just call these Spit{Any}, and have the "ranges" aspect of it be implicit in taking a Span<Range> parameter.
  • Argument ordering: destination, separator vs separator, destination? I went with destination, separator so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.
  • The methods all return the number of System.Range values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
  • System.Range is unmanaged and can be stackalloc'd.
  • The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

  1. https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(dependentParts, ',') != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);
  1. https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(strLatLong, ',') != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0]], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1]], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }
  1. while ((zoneTabFileLine = sr.ReadLine()) != null)
    {
    if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
    {
    // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
    int firstTabIndex = zoneTabFileLine.IndexOf('\t');
    if (firstTabIndex >= 0)
    {
    int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
    if (secondTabIndex >= 0)
    {
    string timeZoneId;
    int startIndex = secondTabIndex + 1;
    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
    if (thirdTabIndex >= 0)
    {
    int length = thirdTabIndex - startIndex;
    timeZoneId = zoneTabFileLine.Substring(startIndex, length);
    }
    else
    {
    timeZoneId = zoneTabFileLine.Substring(startIndex);
    }
    if (!string.IsNullOrEmpty(timeZoneId))
    {
    timeZoneIds.Add(timeZoneId);
    }
    }
    }
    }
    }

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges(ranges, '\t');
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

#934 (comment)

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime labels Sep 26, 2022
@stephentoub stephentoub added this to the 8.0.0 milestone Sep 26, 2022
@ghost
Copy link

ghost commented Sep 26, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, char separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separators, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;

+    public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAnyAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;

    // I'm not sure it's worth having the following (just to enable extension syntax on `Span<T>`), but we could for consistency.
    // My expection is the vast majority use will be with `ReadOnlySpan<char>` due to `string.AsSpan()` producing that.
+    public static int SplitAsRanges<T>(this Span<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAnyAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;
}
  • The methods all return the number of System.Range values written into destination.
  • System.Range is unmanaged and can be stackalloc'd.
  • Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance.
  • Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
  • The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

  1. https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(',', dependentParts) != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);
  1. https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(',', strLatLong) != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }
  1. while ((zoneTabFileLine = sr.ReadLine()) != null)
    {
    if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
    {
    // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
    int firstTabIndex = zoneTabFileLine.IndexOf('\t');
    if (firstTabIndex >= 0)
    {
    int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
    if (secondTabIndex >= 0)
    {
    string timeZoneId;
    int startIndex = secondTabIndex + 1;
    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
    if (thirdTabIndex >= 0)
    {
    int length = thirdTabIndex - startIndex;
    timeZoneId = zoneTabFileLine.Substring(startIndex, length);
    }
    else
    {
    timeZoneId = zoneTabFileLine.Substring(startIndex);
    }
    if (!string.IsNullOrEmpty(timeZoneId))
    {
    timeZoneIds.Add(timeZoneId);
    }
    }
    }
    }
    }

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges('\t', ranges);
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

#934 (comment)

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Runtime

Milestone: 8.0.0

@iSazonov
Copy link
Contributor

iSazonov commented Sep 27, 2022

Looks very convenient! Thanks!

Please clarify, if we stackalloc N+1 will last element contains full tail?


Based on "the consumer knows how many split values are expected, wants to extract the Nth value, etc." and examples above maybe return an enum for more readability? E.g. return "Success", "Overflow", ... (and result count in out parameter if needed)?

@geeknoid
Copy link
Member

geeknoid commented Nov 3, 2022

@stephentoub The API shown above is close to what we have developed internally. But it's missing a variant which has proven popular, which is a callback approach. The callback eliminates the need to allocate any span and lets the code process things on-the-fly in a very efficient way.

Here's the full API surface we implemented:

public static partial class StringSplitExtensions
{
    // core overloads, matching in general combination of features what String.Split offers
    public static bool TrySplit(this ReadOnlySpan<char> input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // simple wrappers for the above intended to help IntelliSense discovery
    public static bool TrySplit(this string input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // a callback invoked as ranges are discovered in the input
    public delegate void SplitVisitor<T>(ReadOnlySpan<char> split, int rangeNum, T context);

    // similar set of overloads as above, but this time with callback semantics
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);

    // and once again for strings, to help IntelliSense discovery
    public static void VisitSplits<TContext>(this string input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
}

// makes working with the above cleaner for app code
public static class RangeExtensions
{
    public static ReadOnlySpan<char> AsSpan(this string value, Range range) => value.AsSpan(range.Index, range.Count);
    public static string Substring(this string value, Range range) => value.AsSpan(range).ToString();
    public static ReadOnlySpan<char> Slice(this ReadOnlySpan<char> value, Range range) => value.Slice(range.Index, range.Count);
}

@stephentoub
Copy link
Member Author

But it's missing a variant which has proven popular, which is a callback approach.

Why is a callback approach better than an enumeration approach ala #934 (comment)? With enumeration you're not constantly invoking delegates, you don't have to worry about complicated mechanisms to thread state into the delegate (or incur allocation from state captured implicitly), you have normal language semantics available around breaking out of the loop early if desired, etc.

@geeknoid
Copy link
Member

geeknoid commented Nov 4, 2022

True, I initially didn't consider the enumeration pattern. That does look better.

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 5, 2022
@stephentoub
Copy link
Member Author

stephentoub commented Nov 5, 2022

Please clarify, if we stackalloc N+1 will last element contains full tail?

Let's say you have the input "a,b,c" and a separator ,. My intent with the API is you'd get the following...

Calling this with dest.Length == 1, you'd get:
dest[0] == 0..5
returning 1.

Calling this with dest.Length == 2, you'd get:
dest[0] == 0..1
dest[1] == 2..5
returning 2.

Calling this with dest.Length == 3, you'd get:
dest[0] == 0..1
dest[1] == 2..3
dest[2] == 4..5
returning 3.

Calling this with dest.Length == 4, you'd get:
dest[0] == 0..1
dest[1] == 2..3
dest[2] == 4..5
dest[3] == unwritten
returning 3.

Basically exactly the same semantics as string.Split where count == dest.Length. This should help make it easier for code to switch from string.Split.

@stephentoub stephentoub self-assigned this Nov 6, 2022
@iSazonov
Copy link
Contributor

iSazonov commented Nov 6, 2022

Thanks for clarify! Look very good.
This immediately triggers the need for #934 in cases where we do not know how many split values are expected.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Nov 8, 2022

I have concerns about the examples called out in the original message. In both example (1) and example (2), I find the "after" code to be less readable than the "before" code.

(And in fact the "after" code for example (1) could stack overflow the process unless the stackalloc is moved to the top of the method, which means the declaration is significantly separated from its usage.)

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

Do we need an analyzer or something else here to help callers avoid bugs?

@stephentoub
Copy link
Member Author

stephentoub commented Nov 8, 2022

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

This ends up mapping exactly to the semantics of the existing string.Split when you provide a count. In fact, in my local branch for this, I just augmented all of the existing string.Split tests to also test the new method. So anywhere you provide a count to string.Split today, if you pass in new Range[count], that array will be filled in with ranges for exactly the same subsets as are returned in the new string[] array with the method today. If you pass in N+1 to string.Split today, you'll get back a string[] with Length == N, and with the new method, count == N.

I find the "after" code to be less readable than the "before" code.

It may be, but it also allocates much less. What would you recommend instead? This is as close as I could come to what we have today with string.Split while still being allocation-free and achieving all of the same scenarios where the expected number of parts is known in advance (for cases where there's no upper bound known on the number of parts, we can use an enumeration model as outlined in the other issue, but that's very difficult to work with for the common scenarios where you do know the number of parts in advance, want to validate the number of parts parsed, etc.)

@GrabYourPitchforks
Copy link
Member

What would you recommend instead?

I don't have a general-purpose recommendation appropriate for BCL, unfortunately. Like many others, I wrote my own helpers for use within my own projects, but they're opinionated. For example, here's a sketch of a helper I have which allows writing (var a, var b, ...) = theSpan.Split(','); and matching exactly the number of elements you request.

static SplitHelper MemoryExtensions.Split(this ROS<char> text, char separator);
ref struct SplitHelper
{
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c, out ROS<char> d);
    // ...
}

Since you're trying to support considerably more scenarios (variable number of matches, various failure conditions, etc.) these opinionated helpers won't really help.

But back to the issue at hand, we do have several APIs which take a destination buffer as an argument and populate that buffer to the best of its ability, returning an error if the total number of elements would overflow the buffer. Many of the APIs on MemoryExtensions do just that. My concern is that people might say to themselves "oh, this takes a destination buffer as a parameter, and I'm familiar with that existing pattern!" - expecting the API to fail on overrun, since that's the pattern most closely matched by the proposed API surface.

But the proposed API actually matches the behavior of string.Split, where tail elements are collapsed into the final returned element. This behavior makes sense for people accustomed to string.Split, but not for people accustomed to all the rest of the MemoryExtensions surface area.

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps. But unfortunately I don't have any good suggestions for you. Maybe the best answer really is just to ship it and ensure the docs are air-tight.

@stephentoub
Copy link
Member Author

Thanks.

Maybe the best answer really is just to ship it and ensure the docs are air-tight.

👍

@iSazonov
Copy link
Contributor

iSazonov commented Nov 8, 2022

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps.

My first thoughts were about Try...() patterns too but description of the proposal explicitly explains that it is overhead.

To reduce the chance of misunderstanding we could (1) make this completely consistent with #934, (2) by finding more informative names for methods and parameters.

For example, using "Range" in "SplitAsRange" looks superfluous since this information is already in the result type. Perhaps there is a way to briefly say that we are getting exactly the split we are asking for? Maybe SplitExactly?

@terrajobst
Copy link
Member

terrajobst commented Nov 29, 2022

Video

  • Let's drop the AsRanges() suffix
namespace System;

public static class MemoryExtensions
{
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 29, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 30, 2022
@davidfowl
Copy link
Member

Is there any reason we didn't consider byte overloads? Our utf8 plans have shifted from a separate type to just embracing byte in more APIs, feels like we should also add ReadOnlySpan<byte> overloads as well?

@stephentoub
Copy link
Member Author

We discussed it. General conclusion was we'd be fine adding such overloads subsequently if ASP.NET or others have demonstrated places where it'd be used and we'd include something in the signature (method name, parameter names, Encoding parameter, whatever) that indicated we're dealing with UTF8 text (as opposed to, for example, just having a generic overload that happened to work with bytes).

@iSazonov
Copy link
Contributor

iSazonov commented Dec 2, 2022

Make sense to consider using IndexOfAnyValues in the APIs?

@stephentoub
Copy link
Member Author

Make sense to consider using IndexOfAnyValues in the APIs?

I don't see how, but there's already a PR up; feel free to comment on if you think there's a missed opportunity.

@iSazonov
Copy link
Contributor

iSazonov commented Dec 2, 2022

Hmm, if classic Split now uses IndexOfAnyValues to cache probabilistic map why the API cannot?

@stephentoub
Copy link
Member Author

if classic Split now uses IndexOfAnyValues to cache probabilistic map

It doesn't. And in the PR you'll see the APIs are calling into the exact same functions that string.Split does to find the separators.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Dec 6, 2022

It would be nice if the API made it easy to tell if there is a possibility of more ranges or the API walked to the end of the source. When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not. i.e. should the API return OperationStatus?

@stephentoub
Copy link
Member Author

When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not.

This is why for that use case the intent is to pass in a size one larger than the number of ranges you care about. That way, if it returns the desired number of ranges, you know there wasn't anything extraneous. You can see several examples of this in the PR providing the implementation here:
https://github.com/dotnet/runtime/pull/79048/files

@KrzysztofCwalina
Copy link
Member

I am talking about a situation when I don't know how many ranges I care about. I want to process all segments and I don't know how many segments there are. With the current APIs, I have to write something like the following

while(true) { 
    var count = source.Split(destination, ',');
    for (int i=0; i < count; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (count != size) {
        break;
    }
    source = source.Slice(destination[size - 1].End.Value + 1);
}

with OperationStatus-returning APIs:

while(true) { 
    var status = source.Split(destination, ',', out int written, out int read);
    for (int i = 0; i < written; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (status == OperationStatus.Done) break;
    source = source.Slice(read);
}

Having said that, I think the main help I had in mind is in the read out parameter. The source.Slice(destination[size - 1].End.Value + 1) slice statement took me some time to figure out.

@stephentoub
Copy link
Member Author

stephentoub commented Dec 6, 2022

I am talking about a situation when I don't know how many ranges I care about.

This API isn't for that. #934 is for that. Your example with the proposed API there then would be more like:

foreach (Range r in source.Split(','))
{
    Console.WriteLine(Encoding.UTF8.GetString(source[r]));
}

@terrajobst
Copy link
Member

Do we have an API to fix @KrzysztofCwalina bracing style? 😏

@KrzysztofCwalina
Copy link
Member

@terrajobst, I am just following the guidelines in FDG ed 2 page 364 section A.1.1. Should I not follow these? I not, then how do I [cherry] pick which guidelines to follow? :-)

@terrajobst
Copy link
Member

Do those guidelines use different bracing styles for for-loops and while-loops? 😱😳

@KrzysztofCwalina
Copy link
Member

Quoting the last guideline form page 364: "Do place the closing brace on its own line unless followed by an else, else if, or while statement." ... so the answer is yes.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants