[API Proposal]: Alternative span/memory/string splitting API #76186

stephentoub · 2022-09-26T14:16:22Z

Background and motivation

#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}

Naming: we could just call these Spit{Any}, and have the "ranges" aspect of it be implicit in taking a Span<Range> parameter.
Argument ordering: destination, separator vs separator, destination? I went with destination, separator so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.
The methods all return the number of System.Range values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
System.Range is unmanaged and can be stackalloc'd.
The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(dependentParts, ',') != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);

https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(strLatLong, ',') != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0]], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1]], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

runtime/src/libraries/System.Private.CoreLib/src/System/TimeZoneInfo.Unix.NonAndroid.cs

Lines 86 to 118 in b092091

    
           while ((zoneTabFileLine = sr.ReadLine()) != null) 
        
           { 
        
               if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#') 
        
               { 
        
                   // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments" 
        
                   int firstTabIndex = zoneTabFileLine.IndexOf('\t'); 
        
                   if (firstTabIndex >= 0) 
        
                   { 
        
                       int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1); 
        
                       if (secondTabIndex >= 0) 
        
                       { 
        
                           string timeZoneId; 
        
                           int startIndex = secondTabIndex + 1; 
        
                           int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex); 
        
                           if (thirdTabIndex >= 0) 
        
                           { 
        
                               int length = thirdTabIndex - startIndex; 
        
                               timeZoneId = zoneTabFileLine.Substring(startIndex, length); 
        
                           } 
        
                           else 
        
                           { 
        
                               timeZoneId = zoneTabFileLine.Substring(startIndex); 
        
                           } 
        
                           if (!string.IsNullOrEmpty(timeZoneId)) 
        
                           { 
        
                               timeZoneIds.Add(timeZoneId); 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges(ranges, '\t');
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

#934 (comment)

Risks

No response

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-26T14:16:38Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, char separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separators, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;

+    public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAnyAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;

    // I'm not sure it's worth having the following (just to enable extension syntax on `Span<T>`), but we could for consistency.
    // My expection is the vast majority use will be with `ReadOnlySpan<char>` due to `string.AsSpan()` producing that.
+    public static int SplitAsRanges<T>(this Span<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+    public static int SplitAnyAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;
}

The methods all return the number of System.Range values written into destination.
System.Range is unmanaged and can be stackalloc'd.
Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance.
Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(',', dependentParts) != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);

https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(',', strLatLong) != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

runtime/src/libraries/System.Private.CoreLib/src/System/TimeZoneInfo.Unix.NonAndroid.cs

Lines 86 to 118 in b092091

    
           while ((zoneTabFileLine = sr.ReadLine()) != null) 
        
           { 
        
               if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#') 
        
               { 
        
                   // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments" 
        
                   int firstTabIndex = zoneTabFileLine.IndexOf('\t'); 
        
                   if (firstTabIndex >= 0) 
        
                   { 
        
                       int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1); 
        
                       if (secondTabIndex >= 0) 
        
                       { 
        
                           string timeZoneId; 
        
                           int startIndex = secondTabIndex + 1; 
        
                           int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex); 
        
                           if (thirdTabIndex >= 0) 
        
                           { 
        
                               int length = thirdTabIndex - startIndex; 
        
                               timeZoneId = zoneTabFileLine.Substring(startIndex, length); 
        
                           } 
        
                           else 
        
                           { 
        
                               timeZoneId = zoneTabFileLine.Substring(startIndex); 
        
                           } 
        
                           if (!string.IsNullOrEmpty(timeZoneId)) 
        
                           { 
        
                               timeZoneIds.Add(timeZoneId); 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges('\t', ranges);
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

#934 (comment)

Risks

No response

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime`
Milestone:	8.0.0

iSazonov · 2022-09-27T06:58:31Z

Looks very convenient! Thanks!

Please clarify, if we stackalloc N+1 will last element contains full tail?

Based on "the consumer knows how many split values are expected, wants to extract the Nth value, etc." and examples above maybe return an enum for more readability? E.g. return "Success", "Overflow", ... (and result count in out parameter if needed)?

geeknoid · 2022-11-03T23:06:54Z

@stephentoub The API shown above is close to what we have developed internally. But it's missing a variant which has proven popular, which is a callback approach. The callback eliminates the need to allocate any span and lets the code process things on-the-fly in a very efficient way.

Here's the full API surface we implemented:

public static partial class StringSplitExtensions
{
    // core overloads, matching in general combination of features what String.Split offers
    public static bool TrySplit(this ReadOnlySpan<char> input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // simple wrappers for the above intended to help IntelliSense discovery
    public static bool TrySplit(this string input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // a callback invoked as ranges are discovered in the input
    public delegate void SplitVisitor<T>(ReadOnlySpan<char> split, int rangeNum, T context);

    // similar set of overloads as above, but this time with callback semantics
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);

    // and once again for strings, to help IntelliSense discovery
    public static void VisitSplits<TContext>(this string input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
}

// makes working with the above cleaner for app code
public static class RangeExtensions
{
    public static ReadOnlySpan<char> AsSpan(this string value, Range range) => value.AsSpan(range.Index, range.Count);
    public static string Substring(this string value, Range range) => value.AsSpan(range).ToString();
    public static ReadOnlySpan<char> Slice(this ReadOnlySpan<char> value, Range range) => value.Slice(range.Index, range.Count);
}

stephentoub · 2022-11-04T00:36:45Z

But it's missing a variant which has proven popular, which is a callback approach.

Why is a callback approach better than an enumeration approach ala #934 (comment)? With enumeration you're not constantly invoking delegates, you don't have to worry about complicated mechanisms to thread state into the delegate (or incur allocation from state captured implicitly), you have normal language semantics available around breaking out of the loop early if desired, etc.

geeknoid · 2022-11-04T17:30:53Z

True, I initially didn't consider the enumeration pattern. That does look better.

stephentoub · 2022-11-05T20:57:10Z

Please clarify, if we stackalloc N+1 will last element contains full tail?

Let's say you have the input "a,b,c" and a separator ,. My intent with the API is you'd get the following...

Calling this with dest.Length == 1, you'd get:
dest[0] == 0..5
returning 1.

Calling this with dest.Length == 2, you'd get:
dest[0] == 0..1
dest[1] == 2..5
returning 2.

Calling this with dest.Length == 3, you'd get:
dest[0] == 0..1
dest[1] == 2..3
dest[2] == 4..5
returning 3.

Calling this with dest.Length == 4, you'd get:
dest[0] == 0..1
dest[1] == 2..3
dest[2] == 4..5
dest[3] == unwritten
returning 3.

Basically exactly the same semantics as string.Split where count == dest.Length. This should help make it easier for code to switch from string.Split.

iSazonov · 2022-11-06T10:12:30Z

Thanks for clarify! Look very good.
This immediately triggers the need for #934 in cases where we do not know how many split values are expected.

GrabYourPitchforks · 2022-11-08T01:50:20Z

I have concerns about the examples called out in the original message. In both example (1) and example (2), I find the "after" code to be less readable than the "before" code.

(And in fact the "after" code for example (1) could stack overflow the process unless the stackalloc is moved to the top of the method, which means the declaration is significantly separated from its usage.)

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

Do we need an analyzer or something else here to help callers avoid bugs?

stephentoub · 2022-11-08T01:58:57Z

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

This ends up mapping exactly to the semantics of the existing string.Split when you provide a count. In fact, in my local branch for this, I just augmented all of the existing string.Split tests to also test the new method. So anywhere you provide a count to string.Split today, if you pass in new Range[count], that array will be filled in with ranges for exactly the same subsets as are returned in the new string[] array with the method today. If you pass in N+1 to string.Split today, you'll get back a string[] with Length == N, and with the new method, count == N.

I find the "after" code to be less readable than the "before" code.

It may be, but it also allocates much less. What would you recommend instead? This is as close as I could come to what we have today with string.Split while still being allocation-free and achieving all of the same scenarios where the expected number of parts is known in advance (for cases where there's no upper bound known on the number of parts, we can use an enumeration model as outlined in the other issue, but that's very difficult to work with for the common scenarios where you do know the number of parts in advance, want to validate the number of parts parsed, etc.)

GrabYourPitchforks · 2022-11-08T02:20:13Z

What would you recommend instead?

I don't have a general-purpose recommendation appropriate for BCL, unfortunately. Like many others, I wrote my own helpers for use within my own projects, but they're opinionated. For example, here's a sketch of a helper I have which allows writing (var a, var b, ...) = theSpan.Split(','); and matching exactly the number of elements you request.

static SplitHelper MemoryExtensions.Split(this ROS<char> text, char separator);
ref struct SplitHelper
{
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c, out ROS<char> d);
    // ...
}

Since you're trying to support considerably more scenarios (variable number of matches, various failure conditions, etc.) these opinionated helpers won't really help.

But back to the issue at hand, we do have several APIs which take a destination buffer as an argument and populate that buffer to the best of its ability, returning an error if the total number of elements would overflow the buffer. Many of the APIs on MemoryExtensions do just that. My concern is that people might say to themselves "oh, this takes a destination buffer as a parameter, and I'm familiar with that existing pattern!" - expecting the API to fail on overrun, since that's the pattern most closely matched by the proposed API surface.

But the proposed API actually matches the behavior of string.Split, where tail elements are collapsed into the final returned element. This behavior makes sense for people accustomed to string.Split, but not for people accustomed to all the rest of the MemoryExtensions surface area.

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps. But unfortunately I don't have any good suggestions for you. Maybe the best answer really is just to ship it and ensure the docs are air-tight.

stephentoub · 2022-11-08T02:36:28Z

Thanks.

Maybe the best answer really is just to ship it and ensure the docs are air-tight.

👍

iSazonov · 2022-11-08T04:12:19Z

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps.

My first thoughts were about Try...() patterns too but description of the proposal explicitly explains that it is overhead.

To reduce the chance of misunderstanding we could (1) make this completely consistent with #934, (2) by finding more informative names for methods and parameters.

For example, using "Range" in "SplitAsRange" looks superfluous since this information is already in the result type. Perhaps there is a way to briefly say that we are getting exactly the split we are asking for? Maybe SplitExactly?

terrajobst · 2022-11-29T19:55:24Z

Video

Let's drop the AsRanges() suffix

namespace System;

public static class MemoryExtensions
{
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}

davidfowl · 2022-11-30T17:38:34Z

Is there any reason we didn't consider byte overloads? Our utf8 plans have shifted from a separate type to just embracing byte in more APIs, feels like we should also add ReadOnlySpan<byte> overloads as well?

stephentoub · 2022-11-30T17:56:56Z

We discussed it. General conclusion was we'd be fine adding such overloads subsequently if ASP.NET or others have demonstrated places where it'd be used and we'd include something in the signature (method name, parameter names, Encoding parameter, whatever) that indicated we're dealing with UTF8 text (as opposed to, for example, just having a generic overload that happened to work with bytes).

iSazonov · 2022-12-02T03:47:53Z

Make sense to consider using IndexOfAnyValues in the APIs?

stephentoub · 2022-12-02T03:49:34Z

Make sense to consider using IndexOfAnyValues in the APIs?

I don't see how, but there's already a PR up; feel free to comment on if you think there's a missed opportunity.

iSazonov · 2022-12-02T03:57:52Z

~~Hmm, if classic Split now uses IndexOfAnyValues to cache probabilistic map why the API cannot?~~

stephentoub · 2022-12-02T03:59:27Z

if classic Split now uses IndexOfAnyValues to cache probabilistic map

It doesn't. And in the PR you'll see the APIs are calling into the exact same functions that string.Split does to find the separators.

KrzysztofCwalina · 2022-12-06T17:02:16Z

It would be nice if the API made it easy to tell if there is a possibility of more ranges or the API walked to the end of the source. When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not. i.e. should the API return OperationStatus?

stephentoub · 2022-12-06T17:08:22Z

When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not.

This is why for that use case the intent is to pass in a size one larger than the number of ranges you care about. That way, if it returns the desired number of ranges, you know there wasn't anything extraneous. You can see several examples of this in the PR providing the implementation here:
https://github.com/dotnet/runtime/pull/79048/files

KrzysztofCwalina · 2022-12-06T18:12:10Z

I am talking about a situation when I don't know how many ranges I care about. I want to process all segments and I don't know how many segments there are. With the current APIs, I have to write something like the following

while(true) { 
    var count = source.Split(destination, ',');
    for (int i=0; i < count; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (count != size) {
        break;
    }
    source = source.Slice(destination[size - 1].End.Value + 1);
}

with OperationStatus-returning APIs:

while(true) { 
    var status = source.Split(destination, ',', out int written, out int read);
    for (int i = 0; i < written; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (status == OperationStatus.Done) break;
    source = source.Slice(read);
}

Having said that, I think the main help I had in mind is in the read out parameter. The source.Slice(destination[size - 1].End.Value + 1) slice statement took me some time to figure out.

stephentoub · 2022-12-06T18:13:09Z

I am talking about a situation when I don't know how many ranges I care about.

This API isn't for that. #934 is for that. Your example with the proposed API there then would be more like:

foreach (Range r in source.Split(','))
{
    Console.WriteLine(Encoding.UTF8.GetString(source[r]));
}

terrajobst · 2022-12-06T23:01:25Z

Do we have an API to fix @KrzysztofCwalina bracing style? 😏

KrzysztofCwalina · 2022-12-07T16:48:01Z

@terrajobst, I am just following the guidelines in FDG ed 2 page 364 section A.1.1. Should I not follow these? I not, then how do I [cherry] pick which guidelines to follow? :-)

terrajobst · 2022-12-07T16:49:49Z

Do those guidelines use different bracing styles for for-loops and while-loops? 😱😳

KrzysztofCwalina · 2022-12-07T17:14:08Z

Quoting the last guideline form page 364: "Do place the closing brace on its own line unless followed by an else, else if, or while statement." ... so the answer is yes.

stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime labels Sep 26, 2022

stephentoub added this to the 8.0.0 milestone Sep 26, 2022

neon-sunset mentioned this issue Sep 28, 2022

[API Proposal]: SplitFirst/SplitLast methods for strings and (ReadOnly)Span/Memory<T> slices #75317

Open

stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 5, 2022

stephentoub self-assigned this Nov 6, 2022

tarekgh mentioned this issue Nov 22, 2022

Libraries work to support Cloud Native effort for .NET 8 #78518

Closed

61 tasks

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 29, 2022

bbartels mentioned this issue Nov 30, 2022

Add 'split' support for ReadOnlySpan<char> similar to string #934

Closed

stephentoub mentioned this issue Nov 30, 2022

Add MemoryExtensions.Split #79048

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 30, 2022

dakersnar mentioned this issue Nov 30, 2022

System.Runtime work planned for .NET 8 #79053

Closed

31 tasks

stephentoub closed this as completed in #79048 Dec 8, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2022

ghost locked as resolved and limited conversation to collaborators Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: Alternative span/memory/string splitting API #76186

[API Proposal]: Alternative span/memory/string splitting API #76186

stephentoub commented Sep 26, 2022 •

edited

Loading

ghost commented Sep 26, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

iSazonov commented Sep 27, 2022 •

edited

Loading

geeknoid commented Nov 3, 2022

stephentoub commented Nov 4, 2022

geeknoid commented Nov 4, 2022

stephentoub commented Nov 5, 2022 •

edited

Loading

iSazonov commented Nov 6, 2022

GrabYourPitchforks commented Nov 8, 2022 •

edited

Loading

stephentoub commented Nov 8, 2022 •

edited

Loading

GrabYourPitchforks commented Nov 8, 2022

stephentoub commented Nov 8, 2022

iSazonov commented Nov 8, 2022

terrajobst commented Nov 29, 2022 •

edited by dotnet-api-review bot

Loading

davidfowl commented Nov 30, 2022

stephentoub commented Nov 30, 2022

iSazonov commented Dec 2, 2022

stephentoub commented Dec 2, 2022

iSazonov commented Dec 2, 2022 •

edited

Loading

stephentoub commented Dec 2, 2022

KrzysztofCwalina commented Dec 6, 2022 •

edited

Loading

stephentoub commented Dec 6, 2022

KrzysztofCwalina commented Dec 6, 2022

stephentoub commented Dec 6, 2022 •

edited

Loading

terrajobst commented Dec 6, 2022

KrzysztofCwalina commented Dec 7, 2022

terrajobst commented Dec 7, 2022

KrzysztofCwalina commented Dec 7, 2022

[API Proposal]: Alternative span/memory/string splitting API #76186

[API Proposal]: Alternative span/memory/string splitting API #76186

Comments

stephentoub commented Sep 26, 2022 • edited Loading

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

ghost commented Sep 26, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

iSazonov commented Sep 27, 2022 • edited Loading

geeknoid commented Nov 3, 2022

stephentoub commented Nov 4, 2022

geeknoid commented Nov 4, 2022

stephentoub commented Nov 5, 2022 • edited Loading

iSazonov commented Nov 6, 2022

GrabYourPitchforks commented Nov 8, 2022 • edited Loading

stephentoub commented Nov 8, 2022 • edited Loading

GrabYourPitchforks commented Nov 8, 2022

stephentoub commented Nov 8, 2022

iSazonov commented Nov 8, 2022

terrajobst commented Nov 29, 2022 • edited by dotnet-api-review bot Loading

davidfowl commented Nov 30, 2022

stephentoub commented Nov 30, 2022

iSazonov commented Dec 2, 2022

stephentoub commented Dec 2, 2022

iSazonov commented Dec 2, 2022 • edited Loading

stephentoub commented Dec 2, 2022

KrzysztofCwalina commented Dec 6, 2022 • edited Loading

stephentoub commented Dec 6, 2022

KrzysztofCwalina commented Dec 6, 2022

stephentoub commented Dec 6, 2022 • edited Loading

terrajobst commented Dec 6, 2022

KrzysztofCwalina commented Dec 7, 2022

terrajobst commented Dec 7, 2022

KrzysztofCwalina commented Dec 7, 2022

stephentoub commented Sep 26, 2022 •

edited

Loading

iSazonov commented Sep 27, 2022 •

edited

Loading

stephentoub commented Nov 5, 2022 •

edited

Loading

GrabYourPitchforks commented Nov 8, 2022 •

edited

Loading

stephentoub commented Nov 8, 2022 •

edited

Loading

terrajobst commented Nov 29, 2022 •

edited by dotnet-api-review bot

Loading

iSazonov commented Dec 2, 2022 •

edited

Loading

KrzysztofCwalina commented Dec 6, 2022 •

edited

Loading

stephentoub commented Dec 6, 2022 •

edited

Loading