-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Alternative span/memory/string splitting API #76186
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsBackground and motivation#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc. Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over API Proposalnamespace System;
public static class MemoryExtensions
{
+ public static int SplitAsRanges(this ReadOnlySpan<char> source, char separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separator, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, ReadOnlySpan<char> separators, StringSplitOptions options, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAnyAsRanges<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;
// I'm not sure it's worth having the following (just to enable extension syntax on `Span<T>`), but we could for consistency.
// My expection is the vast majority use will be with `ReadOnlySpan<char>` due to `string.AsSpan()` producing that.
+ public static int SplitAsRanges<T>(this Span<T> source, T separator, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separator, Span<Range> destination) where T : IEquatable?;
+ public static int SplitAnyAsRanges<T>(this Span<T> source, ReadOnlySpan<T> separators, Span<Range> destination) where T : IEquatable?;
}
API UsageExamples... Instead of: string[] dependentParts = dependent.Split(',');
if (dependentParts.Length != 3)
{
Log?.LogMessage($"Skipping dependent: {dependent}");
continue;
}
try
{
SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]); this code could be: Span<Range> dependentParts = stackalloc Range[4];
...
if (dependent.AsSpan().SplitAsRanges(',', dependentParts) != 3)
{
Log?.LogMessage($"Skipping dependent: {dependent}");
continue;
}
try
{
SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]); Instead of: var strLatLong = Uri.Substring(4).Split(',');
if (strLatLong.Length != 2)
{
throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
}
try
{
_latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
_longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
}
catch (Exception ex) when (ex is FormatException || ex is OverflowException)
{
throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
} this could be: Span<Range> strLatLong = stackalloc Range[3];
ReadOnlySpan<char> span = Uri.AsSpan(4);
if (span.Split(',', strLatLong) != 2)
{
throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
}
try
{
_latitude = double.Parse(span[strLatLong[0], provider: CultureInfo.InvariantCulture);
_longitude = double.Parse(span[strLatLong[1], provider: CultureInfo.InvariantCulture);
}
catch (Exception ex) when (ex is FormatException || ex is OverflowException)
{
throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
}
Instead of: while ((zoneTabFileLine = sr.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
{
// the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
int firstTabIndex = zoneTabFileLine.IndexOf('\t');
if (firstTabIndex >= 0)
{
int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
if (secondTabIndex >= 0)
{
string timeZoneId;
int startIndex = secondTabIndex + 1;
int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
if (thirdTabIndex >= 0)
{
int length = thirdTabIndex - startIndex;
timeZoneId = zoneTabFileLine.Substring(startIndex, length);
}
else
{
timeZoneId = zoneTabFileLine.Substring(startIndex);
}
if (!string.IsNullOrEmpty(timeZoneId))
{
timeZoneIds.Add(timeZoneId);
}
}
}
}
} this could be: Span<Range> ranges = stackalloc Range[4];
while ((zoneTabFileLine = sr.ReadLine()) != null)
{
if (zoneTabFileLine.StartsWith('#'))
{
// the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
int found = zoneTabFileLine.SplitAsRanges('\t', ranges);
if (found >= 3)
{
timeZoneId = zoneTabFileLine[ranges[3]];
if (timeZoneId.Length != 0)
{
timeZoneIds.Add(timeZoneId);
}
}
}
} Alternative DesignsRisksNo response
|
Looks very convenient! Thanks! Please clarify, if we stackalloc N+1 will last element contains full tail? Based on "the consumer knows how many split values are expected, wants to extract the Nth value, etc." and examples above maybe return an enum for more readability? E.g. return "Success", "Overflow", ... (and result count in out parameter if needed)? |
@stephentoub The API shown above is close to what we have developed internally. But it's missing a variant which has proven popular, which is a callback approach. The callback eliminates the need to allocate any span and lets the code process things on-the-fly in a very efficient way. Here's the full API surface we implemented: public static partial class StringSplitExtensions
{
// core overloads, matching in general combination of features what String.Split offers
public static bool TrySplit(this ReadOnlySpan<char> input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
// simple wrappers for the above intended to help IntelliSense discovery
public static bool TrySplit(this string input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
// a callback invoked as ranges are discovered in the input
public delegate void SplitVisitor<T>(ReadOnlySpan<char> split, int rangeNum, T context);
// similar set of overloads as above, but this time with callback semantics
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
// and once again for strings, to help IntelliSense discovery
public static void VisitSplits<TContext>(this string input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this string input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this string input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>( this string input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>( this string input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
}
// makes working with the above cleaner for app code
public static class RangeExtensions
{
public static ReadOnlySpan<char> AsSpan(this string value, Range range) => value.AsSpan(range.Index, range.Count);
public static string Substring(this string value, Range range) => value.AsSpan(range).ToString();
public static ReadOnlySpan<char> Slice(this ReadOnlySpan<char> value, Range range) => value.Slice(range.Index, range.Count);
} |
Why is a callback approach better than an enumeration approach ala #934 (comment)? With enumeration you're not constantly invoking delegates, you don't have to worry about complicated mechanisms to thread state into the delegate (or incur allocation from state captured implicitly), you have normal language semantics available around breaking out of the loop early if desired, etc. |
True, I initially didn't consider the enumeration pattern. That does look better. |
Let's say you have the input Calling this with dest.Length == 1, you'd get: Calling this with dest.Length == 2, you'd get: Calling this with dest.Length == 3, you'd get: Calling this with dest.Length == 4, you'd get: Basically exactly the same semantics as string.Split where count == dest.Length. This should help make it easier for code to switch from string.Split. |
Thanks for clarify! Look very good. |
I have concerns about the examples called out in the original message. In both example (1) and example (2), I find the "after" code to be less readable than the "before" code. (And in fact the "after" code for example (1) could stack overflow the process unless the stackalloc is moved to the top of the method, which means the declaration is significantly separated from its usage.) This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code. Do we need an analyzer or something else here to help callers avoid bugs? |
This ends up mapping exactly to the semantics of the existing string.Split when you provide a count. In fact, in my local branch for this, I just augmented all of the existing string.Split tests to also test the new method. So anywhere you provide a count to string.Split today, if you pass in
It may be, but it also allocates much less. What would you recommend instead? This is as close as I could come to what we have today with string.Split while still being allocation-free and achieving all of the same scenarios where the expected number of parts is known in advance (for cases where there's no upper bound known on the number of parts, we can use an enumeration model as outlined in the other issue, but that's very difficult to work with for the common scenarios where you do know the number of parts in advance, want to validate the number of parts parsed, etc.) |
I don't have a general-purpose recommendation appropriate for BCL, unfortunately. Like many others, I wrote my own helpers for use within my own projects, but they're opinionated. For example, here's a sketch of a helper I have which allows writing static SplitHelper MemoryExtensions.Split(this ROS<char> text, char separator);
ref struct SplitHelper
{
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b);
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c);
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c, out ROS<char> d);
// ...
} Since you're trying to support considerably more scenarios (variable number of matches, various failure conditions, etc.) these opinionated helpers won't really help. But back to the issue at hand, we do have several APIs which take a destination buffer as an argument and populate that buffer to the best of its ability, returning an error if the total number of elements would overflow the buffer. Many of the APIs on But the proposed API actually matches the behavior of This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps. But unfortunately I don't have any good suggestions for you. Maybe the best answer really is just to ship it and ensure the docs are air-tight. |
Thanks.
👍 |
My first thoughts were about To reduce the chance of misunderstanding we could (1) make this completely consistent with #934, (2) by finding more informative names for methods and parameters. For example, using "Range" in "SplitAsRange" looks superfluous since this information is already in the result type. Perhaps there is a way to briefly say that we are getting exactly the split we are asking for? Maybe |
namespace System;
public static class MemoryExtensions
{
public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
} |
Is there any reason we didn't consider byte overloads? Our utf8 plans have shifted from a separate type to just embracing byte in more APIs, feels like we should also add |
We discussed it. General conclusion was we'd be fine adding such overloads subsequently if ASP.NET or others have demonstrated places where it'd be used and we'd include something in the signature (method name, parameter names, Encoding parameter, whatever) that indicated we're dealing with UTF8 text (as opposed to, for example, just having a generic overload that happened to work with bytes). |
Make sense to consider using IndexOfAnyValues in the APIs? |
I don't see how, but there's already a PR up; feel free to comment on if you think there's a missed opportunity. |
|
It doesn't. And in the PR you'll see the APIs are calling into the exact same functions that string.Split does to find the separators. |
It would be nice if the API made it easy to tell if there is a possibility of more ranges or the API walked to the end of the source. When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not. i.e. should the API return OperationStatus? |
This is why for that use case the intent is to pass in a size one larger than the number of ranges you care about. That way, if it returns the desired number of ranges, you know there wasn't anything extraneous. You can see several examples of this in the PR providing the implementation here: |
I am talking about a situation when I don't know how many ranges I care about. I want to process all segments and I don't know how many segments there are. With the current APIs, I have to write something like the following while(true) {
var count = source.Split(destination, ',');
for (int i=0; i < count; i++)
{
Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
}
if (count != size) {
break;
}
source = source.Slice(destination[size - 1].End.Value + 1);
} with OperationStatus-returning APIs: while(true) {
var status = source.Split(destination, ',', out int written, out int read);
for (int i = 0; i < written; i++)
{
Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
}
if (status == OperationStatus.Done) break;
source = source.Slice(read);
} Having said that, I think the main help I had in mind is in the |
This API isn't for that. #934 is for that. Your example with the proposed API there then would be more like: foreach (Range r in source.Split(','))
{
Console.WriteLine(Encoding.UTF8.GetString(source[r]));
} |
Do we have an API to fix @KrzysztofCwalina bracing style? 😏 |
@terrajobst, I am just following the guidelines in FDG ed 2 page 364 section A.1.1. Should I not follow these? I not, then how do I [cherry] pick which guidelines to follow? :-) |
Do those guidelines use different bracing styles for for-loops and while-loops? 😱😳 |
Quoting the last guideline form page 364: "Do place the closing brace on its own line unless followed by an else, else if, or while statement." ... so the answer is yes. |
Background and motivation
#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.
Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over
ReadOnlySpan<T>
in a way and stores the resulting ranges into a provided location, that then also works withSpan<T>
,{ReadOnly}Memory<T>
, andString
, that doesn't allocate, that automates the retrieval of N values, etc.API Proposal
Spit{Any}
, and have the "ranges" aspect of it be implicit in taking aSpan<Range>
parameter.destination, separator
vsseparator, destination
? I went withdestination, separator
so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.System.Range
values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of NRange
instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1Range
instances, and validate that the returned count was N.System.Range
isunmanaged
and can be stackalloc'd.Range
instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.API Usage
Examples...
Instead of:
this code could be:
Instead of:
this could be:
runtime/src/libraries/System.Private.CoreLib/src/System/TimeZoneInfo.Unix.NonAndroid.cs
Lines 86 to 118 in b092091
Instead of:
this could be:
Alternative Designs
#934 (comment)
Risks
No response
The text was updated successfully, but these errors were encountered: