-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String comparer for sorting numeric strings logically #13979
Comments
That's a good suggestion. Would you be willing to make an API proposal? |
I think conceptually this is a good idea, but there are a lot of details that need to be worked out. It seems like such a comparer might want to take options which control how it behaves. For example, are commas in numbers ok? Spaces? Decimal points? How are different cultures handled? |
While it's easy to define the behavior on Windows systems as following
I believe you could define a reasonable API proposal in terms of a lexicographical ordering of substrings divided at numeric boundaries. Sequences of non-digits would be compared using |
The comparer should require an IFormatProvider or NumberFormatInfo; which can be passed in the constructor or it should use the current one from the owning thread. @terrajobst This is a problem I am facing in almost every business application: proper ordering even if the strings contain numbers. I don't have time to make a PR or document it well, but I would be very happy to see it in the new releases of the core framework. I think the use case is clear enough. |
This is a need that I have often faced. If @Peter-Juhasz does not have the time to create the API proposal or implement it, I would be interested in taking it on. |
@Peter-Juhasz There is probably no need to consider the
If On a side note, even if used there is no need to explicitly provide the |
Looks in Windows, CompareStringEx can be used with the flag SORT_DIGITSASNUMBERS. I didn't look at ICU if it support such functionality too. |
I believe there is few things which need to be worked on here:
I think this would be a nice addition but it needs some work. |
I don't think we need to do that. it would be just option (or flag) passed with the CompareOptions and let the underlying OS handle it.
I agree with that |
I'd see value in both an OS-dependent sort for cases where you want to match the OS (file listing, for example), and an OS-independent .NET sort that natively understands both numbers and dates. A real life example is needing month names to be correctly sorted, even when that's the only date-related part of the string. The OS-dependent sort should be super simple to implement. After that's done I think it would still be worth putting design time in towards an independent managed sort as well, to figure out if and how people's needs differ. |
RationaleFor sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings Using the
but the desired ascending logical sort would be
Proposed API namespace System {
public class StringComparer {
+ public static StringComparer Create(CultureInfo culture, CompareOptions options);
+ public static StringComparer Logical { get; }
+ public static StringComparer LogicalIgnoreCase { get; }
}
}
namespace System.Globalization {
public enum CompareOptions {
+ Logical = 0x00000020
}
} Usagevar list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10" This would also be good for sorting strings containing IP addresses. Details
Open Questions
Updates
|
we are limited to what the underlying OS can give us as functionality support, this need to be investigated if both Windows and Linux can provide this functionality before we proceed here. |
I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET. |
This is not easy to do especially we don't carry the needed collation tables needed to do this. if it is just ASCII characters that is possible. if you have more complicated scenario that will be challenging except if you go and parse the string split them. this will not be nice performance wise. in short, Linguistic comparisons will be challenging. |
You're probably right about the performance. I'm currently using an implementation that splits the string which works for me but I understand it may not be quick enough for the framework due to the string allocations. |
it is not regarding the string allocations. we can pin the strings and avoid the allocations. it is about parsing the string and then compare each part and then handle the digits comparisons. note that, such functionality has to support all digits for all languages too and not just 0~9. |
Ugh, supporting digits for all languages would be a pain as you'd have to go by the After looking at my implementation I think |
For some scenarios I'd prefer an OS-dependent logical sort, for others an invariant logical sort. var comparer = StringComparer.TryGetOSLogical()
?? StringComparer.CreateLogical(StringComparer.CurrentCulture); I want to match the feel of the native OS if possible. |
We don't have to use NumberFormatInfo.NativeDigits because this will make it work with only specific culture while digits in all languages should just work regardless of the chosen culture because digits doesn't have linguistic context (in most cases). char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality. I am not trying to push back here but I am just explaining the complexity we'll have if we need to do it without OS help.
I hope StringComparer.TryGetOSLogical() exist and work all the time but the logic to fallback will create some other issues regarding the difference in the behavior if OS support it and if not. This may be OK but we have to understand that. |
I understand and thank you for going through the complexities with me. In my implementation I |
|
If |
Because the functionality we are talking about is regarding the collation which support all native digits. Also there is some requests we need to support native digits in the formatting and parsing APIs but we didn't get into it yet. In short, if we support only 0~9 digits, sooner or later we'll get request to support the native digits. so it will be good to have a good plan in our support before we jump doing it. |
|
@rixtech, some |
Has the interaction with various Also, are leading zeros supposed to receive any special treatment? For example, does |
@madelson when we implement this feature (with ICU), we'll use UCOL_NUMERIC_COLLATION. This will enable sorting digits only (no separator, symbols, or negative signs is supported).
Yes, these 2 strings will be equal. In short, we are not going to perform this numeric sorting manually, instead we'll use the underlying libraries (ICU or NLS) to do the job. |
voting for that |
This issue has been open since 2015 so assuming it won't get done? I'm using NaturalStringExtensions for now which seems to work well. |
Wow, it has been 8 years since this issue was first opened. What's the status on this? So the current status is that we have these APIs approved: namespace System {
public class StringComparer {
+ public static StringComparer Create(CultureInfo culture, CompareOptions options);
}
}
namespace System.Globalization {
public enum CompareOptions {
+ NumericOrdering = 0x00000020
}
} and from what I see we decided that we'll just use the default available sorting available in the globalisation library (either ICS or NCU) without implementing those ourselves, at least for now.... Is there any other things that we might be missing, that prevents this from going forward? |
Just needs someone with time to implement it. There's been tons of other work and this hasn't bubbled up such that I've been able to do it myself yet. |
I haven't touched the dotnet/runtime repo for quite a while but I'm willing to give it a go if that's the only thing blocking it from happening. I might even have something I was working on a while ago (although it was years ago and I imagine it'll need to be torn down to fit what we have now...). |
I hope this will be implemented one day! This is a critical lacking feature. |
|
Thanks, But is this usuable in PowerShell for example or only in C# ? |
powershell cmdlet # TODO: optimize without copying content
# TODO: support for direct calls: Sort-Naturally "img12.png", "img10.png", "img2.png", "img1.png"
function Sort-Naturally
{
param
(
[Parameter(Mandatory, ValueFromPipeline)]
[string[]]
$input
)
begin
{
$version = "4.3.0"
$temp = $([System.IO.Path]::GetTempPath())
$dllPath = Join-Path "$temp" "nsext$version/lib/net6.0/NaturalSort.Extension.dll"
if (! (Test-Path "$dllPath"))
{
$dest = "$temp/nsext$version"
Invoke-WebRequest https://www.nuget.org/api/v2/package/NaturalSort.Extension/$version -Out "$dest.zip" *> $null
New-Item -ItemType Directory "$dest" *> $null
Expand-Archive "$dest.zip" -DestinationPath "$dest" *> $null
}
[System.Reflection.Assembly]::LoadFrom("$dllPath") *> $null
$natural=[NaturalSort.Extension.NaturalSortExtension]::WithNaturalSort([System.StringComparison]::OrdinalIgnoreCase)
$list = [System.Collections.Generic.List[string]]@()
}
process
{
foreach ($value in $input)
{
$list.Add($value) *> $null
}
}
end
{
[System.Linq.Enumerable]::OrderBy($list, [Func[string,string]]{ param($x) $x }, $natural)
}
} test > "img12.png", "img10.png", "img2.png", "img1.png" | Sort-Naturally
img1.png
img2.png
img10.png
img12.png |
@kasperk81 Thank you very much for this!!! |
@kasperk81 Unfortunately, the implementation has a Unicode bug, when compared to I'd argue that it would make a lot of sense to have this added to the BCL to prevent others falling into the same Unicode traps over and over again. Also, googling for this yields a lot of different approaches that all have similar bugs or are just badly implemented. If there were a comparer provided by the BCL it could be highly optimized and it would quickly become the first search result and people would just use that. I agree with @sharwell that using |
"os sort" and "natural sort" are not the same exact concepts. python https://github.com/SethMMorton/natsort library gives the same output as @tompazourek's dotnet library. i don't know of a sorting implementation in any language that behaves like |
It's a bug when certain characters that are seen as "natural" numbers in some cultures and even recognized by the Unicode Standard are not treated as such, imho. Similarly, as a native speaker of German, I would be annoyed if Ä was sorted after Z instead of after A, just because the guy implementing the "natural" sort didn't see it as natural or didn't know about it. |
"how does it work in other languages on various platforms?" has higher precedence for dotnet than win32 behavior |
Yes, as I already pointed out
I am solely interested in getting proper Unicode support and the Win API does a better job at that than the alternatives I found. At this point for me it's only about Unicode and nothing else. |
Could you submit that support to that open source library you reported the bug to? |
I think it's my library and this issue: tompazourek/NaturalSort.Extension#74 I just added the Unicode support. |
@tompazourek thanks very much! :) |
@tannergooding @tarekgh ICU is now well integrated into Runtime and probably most of the work has already been done. Is there a chance to include this in .Net 9? |
@tannergooding @tarekgh I would like to take getting this over the finish line... are the API reviews current and approved still? |
This will be risky to do it now. Let's plan doing it early next cycle. |
Assigning to @PranavSenthilnathan to try to get this in for .NET 10 Preview 1. |
Rationale
For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings
"Windows 7", "Windows 10"
.Using the
Ordinal
StringComparer
to sort the list one would getbut the desired ascending logical sort would be
Proposed API
Usage
This would also be good for sorting strings containing IP addresses.
Details
Logical
is a convenience property equivalent to the result ofCreate(CultureInfo.CurrentCulture, CompareOptions.Logical)
LogicalIgnoreCase
is a convenience property equivalent to the result ofCreate(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
Char.IsDigit
.Char.GetNumericValue
.ulong
s. Logic for overflows will have to be considered.Windows 8.1
would be considered 4 sequences. TheWindows
would be a string sequence, the8
would be a numeric sequence, the.
would be another string sequence, and the1
would be another numeric sequence.NumberStyles
parameter."a", "7"
the number7
will be sorted before the lettera
.CompareOptions
parameter as input will need to be updated to support the newLogical
member.Open Questions
CompareOptions.Logical
be implemented as the flag optionSORT_DIGITSASNUMBERS
to thedwCmpFlags
parameter ofCompareStringEx
? Using it's implementation should be more efficient but later expanding support forNumberStyles
will require a re-implementation with matching behavior.Updates
Logical
andLogicalIgnoreCase
properties.CreateLogical
overloads to match theCreate
method.NumberFormatInfo
from theStringComparer
parameter when not explicitly provided and is aCultureAwareComparer
.CreateLogical
overloads that matched theCreate
method.CompareOptions.Logical
and changedCreateLogical
to be just an overload ofCreate
.The text was updated successfully, but these errors were encountered: