Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String comparer for sorting numeric strings logically #13979

Open
Peter-Juhasz opened this issue Jan 11, 2015 · 152 comments
Open

String comparer for sorting numeric strings logically #13979

Peter-Juhasz opened this issue Jan 11, 2015 · 152 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@Peter-Juhasz
Copy link

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        NumericOrdering = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

  • Logical is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
  • LogicalIgnoreCase is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
  • Non-numeric sequences will be evaluated with the culture provided.
  • Numeric sequences will be determined by the result of Char.IsDigit.
  • All UTF-16 digits will be supported and are manually parsed using Char.GetNumericValue.
  • Only positive integral values without digit separators will be supported directly.
  • Numbers will be treated as ulongs. Logic for overflows will have to be considered.
  • The string Windows 8.1 would be considered 4 sequences. The Windows would be a string sequence, the 8 would be a numeric sequence, the . would be another string sequence, and the 1 would be another numeric sequence.
  • This API could later be expanded to include support for allowing signs, decimals, and digit separators through the use of overloads accepting a NumberStyles parameter.
  • When a numeric and string sequence are considered at the same time the numeric sequence always comes before the string sequence so when sorting the following list, "a", "7" the number 7 will be sorted before the letter a.
  • Existing methods that take a CompareOptions parameter as input will need to be updated to support the new Logical member.

Open Questions

  • Should CompareOptions.Logical be implemented as the flag option SORT_DIGITSASNUMBERS to the dwCmpFlags parameter of CompareStringEx? Using it's implementation should be more efficient but later expanding support for NumberStyles will require a re-implementation with matching behavior.

Updates

  • Added Logical and LogicalIgnoreCase properties.
  • Added support for all UTF-16 digits.
  • Added more CreateLogical overloads to match the Create method.
  • Added retrieval of the NumberFormatInfo from the StringComparer parameter when not explicitly provided and is a CultureAwareComparer.
  • Removed CreateLogical overloads that matched the Create method.
  • Switched to only supporting positive integral values without digit separators.
  • Added consideration of comparing a numeric sequence with a string sequence.
  • Added the flag member CompareOptions.Logical and changed CreateLogical to be just an overload of Create.
@terrajobst
Copy link
Member

That's a good suggestion. Would you be willing to make an API proposal?

@MgSam
Copy link

MgSam commented Jan 22, 2015

I think conceptually this is a good idea, but there are a lot of details that need to be worked out. It seems like such a comparer might want to take options which control how it behaves. For example, are commas in numbers ok? Spaces? Decimal points? How are different cultures handled?

@sharwell
Copy link
Member

While it's easy to define the behavior on Windows systems as following StrCmpLogicalW, this could be problematic in the long run.

  1. The precise behavior of StrCmpLogicalW is not only undocumented, but it explicitly states that it can change in future releases of Windows.
  2. The behavior of this method is not case-sensitive, and there is no case-sensitive method which is otherwise equivalent. It would be strange, for example, to provide StringComparer.LogicalIgnoreCase without providing StringComparer.Logical.

I believe you could define a reasonable API proposal in terms of a lexicographical ordering of substrings divided at numeric boundaries. Sequences of non-digits would be compared using StringComparer.CurrentCulture or StringComparer.CurrentCultureIgnoreCase, and sequences of digits would be compared according to their numeric values.

@Peter-Juhasz
Copy link
Author

The comparer should require an IFormatProvider or NumberFormatInfo; which can be passed in the constructor or it should use the current one from the owning thread.

@terrajobst This is a problem I am facing in almost every business application: proper ordering even if the strings contain numbers. I don't have time to make a PR or document it well, but I would be very happy to see it in the new releases of the core framework. I think the use case is clear enough.

@eatdrinksleepcode
Copy link

This is a need that I have often faced. If @Peter-Juhasz does not have the time to create the API proposal or implement it, I would be interested in taking it on.

@sharwell
Copy link
Member

@Peter-Juhasz There is probably no need to consider the NumberFormatInfo. As stated in the documentation for NumberFormatInfo.NativeDigits:

The character set that is specified by the NativeDigits property has no effect on parsing or formatting operations. Only the Basic Latin digits 0 (U+0030) through 9 (U+0039) are used when formatting or parsing numeric values or date and time values.

If StringComparer.Logical and StringComparer.LogicalIgnoreCase were defined to use StringComparer.CurrentCulture and StringComparer.CurrentCultureIgnoreCase for comparing non-digits, the culture used for the comparison is clear.

On a side note, even if used there is no need to explicitly provide the NumberFormatInfo to the comparer since the current culture information already provides it.

@ellismg ellismg assigned tarekgh and unassigned ellismg Aug 3, 2016
@tarekgh
Copy link
Member

tarekgh commented Aug 16, 2016

Looks in Windows, CompareStringEx can be used with the flag SORT_DIGITSASNUMBERS. I didn't look at ICU if it support such functionality too.

@karelz karelz assigned krwq and unassigned tarekgh Nov 8, 2016
@krwq
Copy link
Member

krwq commented Nov 28, 2016

I believe there is few things which need to be worked on here:

  • check if there is linux implementation existing and if it behaves similarly to CompareStringEx with SORT_DIGITSASNUMBERS
  • should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)

I think this would be a nice addition but it needs some work.

@krwq krwq removed their assignment Nov 28, 2016
@tarekgh
Copy link
Member

tarekgh commented Nov 28, 2016

should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)

I don't think we need to do that. it would be just option (or flag) passed with the CompareOptions and let the underlying OS handle it.

I think this would be a nice addition but it needs some work.

I agree with that

@jnm2
Copy link
Contributor

jnm2 commented Nov 30, 2016

I'd see value in both an OS-dependent sort for cases where you want to match the OS (file listing, for example), and an OS-independent .NET sort that natively understands both numbers and dates. A real life example is needing month names to be correctly sorted, even when that's the only date-related part of the string.

The OS-dependent sort should be super simple to implement. After that's done I think it would still be worth putting design time in towards an independent managed sort as well, to figure out if and how people's needs differ.

@TylerBrinkley
Copy link
Contributor

TylerBrinkley commented Feb 23, 2017

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
+        public static StringComparer Logical { get; }
+        public static StringComparer LogicalIgnoreCase { get; }
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        Logical = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

  • Logical is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
  • LogicalIgnoreCase is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
  • Non-numeric sequences will be evaluated with the culture provided.
  • Numeric sequences will be determined by the result of Char.IsDigit.
  • All UTF-16 digits will be supported and are manually parsed using Char.GetNumericValue.
  • Only positive integral values without digit separators will be supported directly.
  • Numbers will be treated as ulongs. Logic for overflows will have to be considered.
  • The string Windows 8.1 would be considered 4 sequences. The Windows would be a string sequence, the 8 would be a numeric sequence, the . would be another string sequence, and the 1 would be another numeric sequence.
  • This API could later be expanded to include support for allowing signs, decimals, and digit separators through the use of overloads accepting a NumberStyles parameter.
  • When a numeric and string sequence are considered at the same time the numeric sequence always comes before the string sequence so when sorting the following list, "a", "7" the number 7 will be sorted before the letter a.
  • Existing methods that take a CompareOptions parameter as input will need to be updated to support the new Logical member.

Open Questions

  • Should CompareOptions.Logical be implemented as the flag option SORT_DIGITSASNUMBERS to the dwCmpFlags parameter of CompareStringEx? Using it's implementation should be more efficient but later expanding support for NumberStyles will require a re-implementation with matching behavior.

Updates

  • Added Logical and LogicalIgnoreCase properties.
  • Added support for all UTF-16 digits.
  • Added more CreateLogical overloads to match the Create method.
  • Added retrieval of the NumberFormatInfo from the StringComparer parameter when not explicitly provided and is a CultureAwareComparer.
  • Removed CreateLogical overloads that matched the Create method.
  • Switched to only supporting positive integral values without digit separators.
  • Added consideration of comparing a numeric sequence with a string sequence.
  • Added the flag member CompareOptions.Logical and changed CreateLogical to be just an overload of Create.

@tarekgh
Copy link
Member

tarekgh commented Feb 23, 2017

we are limited to what the underlying OS can give us as functionality support, this need to be investigated if both Windows and Linux can provide this functionality before we proceed here.

@TylerBrinkley
Copy link
Contributor

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

@tarekgh
Copy link
Member

tarekgh commented Feb 23, 2017

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

This is not easy to do especially we don't carry the needed collation tables needed to do this.

if it is just ASCII characters that is possible. if you have more complicated scenario that will be challenging except if you go and parse the string split them. this will not be nice performance wise.

in short, Linguistic comparisons will be challenging.

@TylerBrinkley
Copy link
Contributor

You're probably right about the performance. I'm currently using an implementation that splits the string which works for me but I understand it may not be quick enough for the framework due to the string allocations.

@tarekgh
Copy link
Member

tarekgh commented Feb 23, 2017

it is not regarding the string allocations. we can pin the strings and avoid the allocations. it is about parsing the string and then compare each part and then handle the digits comparisons. note that, such functionality has to support all digits for all languages too and not just 0~9.

@TylerBrinkley
Copy link
Contributor

TylerBrinkley commented Feb 23, 2017

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages.

After looking at my implementation I think char.IsDigit would work just fine.

@jnm2
Copy link
Contributor

jnm2 commented Feb 23, 2017

For some scenarios I'd prefer an OS-dependent logical sort, for others an invariant logical sort.
I'd really love something like this:

var comparer = StringComparer.TryGetOSLogical()
    ?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I want to match the feel of the native OS if possible.

@tarekgh
Copy link
Member

tarekgh commented Feb 23, 2017

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages. After looking at my implementation I think char.IsDigit would work just fine.

We don't have to use NumberFormatInfo.NativeDigits because this will make it work with only specific culture while digits in all languages should just work regardless of the chosen culture because digits doesn't have linguistic context (in most cases).

char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.

I am not trying to push back here but I am just explaining the complexity we'll have if we need to do it without OS help.

var comparer = StringComparer.TryGetOSLogical()
?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I hope StringComparer.TryGetOSLogical() exist and work all the time but the logic to fallback will create some other issues regarding the difference in the behavior if OS support it and if not. This may be OK but we have to understand that.

@TylerBrinkley
Copy link
Contributor

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

@tarekgh
Copy link
Member

tarekgh commented Feb 23, 2017

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

  • Int32.TryParse with NumberFormatInfo will not parse native digits.
  • also if this work, this means if you have en-US NumberFormatInfo you'll not be able to compare strings containing say Urdu digits. and that is what I think we should have all native numbers work regardless of the cultures because numbers are well identified in the Unicode.

@TylerBrinkley
Copy link
Contributor

TylerBrinkley commented Feb 24, 2017

Int32.TryParse with NumberFormatInfo will not parse native digits.

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

@tarekgh
Copy link
Member

tarekgh commented Feb 24, 2017

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

Because the functionality we are talking about is regarding the collation which support all native digits. Also there is some requests we need to support native digits in the formatting and parsing APIs but we didn't get into it yet.

In short, if we support only 0~9 digits, sooner or later we'll get request to support the native digits. so it will be good to have a good plan in our support before we jump doing it.

@rixtech
Copy link

rixtech commented Jan 8, 2021

@rixtech This repository is not related to cmd.exe.
Fair enough, but PowerShell/PowerShell#12931 issue said to vote here as the PowerShell natural sort is dependent on this issue. When PowerShell gets the fix, I would want cmd.exe to get the fix too for consistency.

@KalleOlaviNiemitalo
Copy link

@rixtech, some CMD.exe issues are tracked in https://github.com/microsoft/terminal/issues?q=is%3Aissue+label%3AProduct-Cmd.exe, but changes are usually rejected because of compatibility risks.

@madelson
Copy link
Contributor

Has the interaction with various CompareOptions flags been considered? For example, does IgnoreNonSpace/IgnoreSymbols affect the detection of numeric segments or does it only affect comparison within non-numeric segments? Are there any flags that should throw when combined with Logical?

Also, are leading zeros supposed to receive any special treatment? For example, does "a 01 b" compare equal to "a 1 b"?

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2022

@madelson when we implement this feature (with ICU), we'll use UCOL_NUMERIC_COLLATION. This will enable sorting digits only (no separator, symbols, or negative signs is supported).

UCOL_NUMERIC_COLLATION | When turned on, this attribute makes substrings of digits sort according to their numeric values.This is a way to get '100' to sort AFTER '2'. Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros). If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.A "digit" in this sense is a code point with General_Category=Nd, which does not include circled numbers, roman numerals, etc. Only a contiguous digit substring is considered, that is, non-negative integers without separators. There is no support for plus/minus signs, decimals, exponents, etc.Stable:ICU 2.8

re leading zeros supposed to receive any special treatment? For example, does "a 01 b" compare equal to "a 1 b"?

Yes, these 2 strings will be equal.

In short, we are not going to perform this numeric sorting manually, instead we'll use the underlying libraries (ICU or NLS) to do the job.

@pompushko
Copy link

voting for that

@victorgoebel
Copy link

This issue has been open since 2015 so assuming it won't get done?

I'm using NaturalStringExtensions for now which seems to work well.

image

@Gnbrkm41
Copy link
Contributor

Wow, it has been 8 years since this issue was first opened. What's the status on this?

So the current status is that we have these APIs approved:

namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
     }
}
namespace System.Globalization {
     public enum CompareOptions {
+        NumericOrdering = 0x00000020
     }
 }

and from what I see we decided that we'll just use the default available sorting available in the globalisation library (either ICS or NCU) without implementing those ourselves, at least for now.... Is there any other things that we might be missing, that prevents this from going forward?

@tannergooding
Copy link
Member

Just needs someone with time to implement it. There's been tons of other work and this hasn't bubbled up such that I've been able to do it myself yet.

@Gnbrkm41
Copy link
Contributor

I haven't touched the dotnet/runtime repo for quite a while but I'm willing to give it a go if that's the only thing blocking it from happening. I might even have something I was working on a while ago (although it was years ago and I imagine it'll need to be torn down to fit what we have now...).

@zotabee
Copy link

zotabee commented Nov 24, 2023

I hope this will be implemented one day!

This is a critical lacking feature.

@kasperk81
Copy link
Contributor

This is a critical lacking feature.

dotnet add package NaturalSort.Extension gives you StringComparison.OrdinalIgnoreCase.WithNaturalSort()
https://github.com/tompazourek/NaturalSort.Extension#usage
not everything has to come from bcl to make it count

@zotabee
Copy link

zotabee commented Nov 26, 2023

This is a critical lacking feature.

dotnet add package NaturalSort.Extension gives you StringComparison.OrdinalIgnoreCase.WithNaturalSort() https://github.com/tompazourek/NaturalSort.Extension#usage not everything has to come from bcl to make it count

Thanks, But is this usuable in PowerShell for example or only in C# ?

@kasperk81
Copy link
Contributor

kasperk81 commented Nov 26, 2023

powershell cmdlet

# TODO: optimize without copying content
# TODO: support for direct calls: Sort-Naturally "img12.png", "img10.png", "img2.png", "img1.png"
function Sort-Naturally
{
    param
    (
        [Parameter(Mandatory, ValueFromPipeline)]
        [string[]]
        $input
    )
    begin
    {
        $version = "4.3.0"
        $temp = $([System.IO.Path]::GetTempPath())
        $dllPath = Join-Path "$temp" "nsext$version/lib/net6.0/NaturalSort.Extension.dll"
        if (! (Test-Path "$dllPath"))
        {
            $dest = "$temp/nsext$version"
            Invoke-WebRequest https://www.nuget.org/api/v2/package/NaturalSort.Extension/$version -Out "$dest.zip" *> $null
            New-Item -ItemType Directory "$dest" *> $null
            Expand-Archive "$dest.zip" -DestinationPath "$dest" *> $null
        }

        [System.Reflection.Assembly]::LoadFrom("$dllPath") *> $null

        $natural=[NaturalSort.Extension.NaturalSortExtension]::WithNaturalSort([System.StringComparison]::OrdinalIgnoreCase)
        $list = [System.Collections.Generic.List[string]]@()
    }
    process
    {
        foreach ($value in $input)
        {
            $list.Add($value) *> $null
        }
    }
    end
    {
        [System.Linq.Enumerable]::OrderBy($list, [Func[string,string]]{ param($x) $x }, $natural)
    }
}

test

> "img12.png", "img10.png", "img2.png", "img1.png" | Sort-Naturally
img1.png
img2.png
img10.png
img12.png

@zotabee
Copy link

zotabee commented Nov 26, 2023

@kasperk81 Thank you very much for this!!!

@siegfriedpammer
Copy link

This is a critical lacking feature.

dotnet add package NaturalSort.Extension gives you StringComparison.OrdinalIgnoreCase.WithNaturalSort() tompazourek/NaturalSort.Extension#usage not everything has to come from bcl to make it count

@kasperk81 Unfortunately, the implementation has a Unicode bug, when compared to StrCmpLogicalW: tompazourek/NaturalSort.Extension#74

I'd argue that it would make a lot of sense to have this added to the BCL to prevent others falling into the same Unicode traps over and over again. Also, googling for this yields a lot of different approaches that all have similar bugs or are just badly implemented. If there were a comparer provided by the BCL it could be highly optimized and it would quickly become the first search result and people would just use that.

I agree with @sharwell that using StrCmpLogicalW as a reference is problematic, but my point still stands that correct Unicode support is necessary but tricky.

@kasperk81
Copy link
Contributor

StrCmpLogicalW

"os sort" and "natural sort" are not the same exact concepts. python https://github.com/SethMMorton/natsort library gives the same output as @tompazourek's dotnet library. i don't know of a sorting implementation in any language that behaves like StrCmpLogicalW by default. natsort added that special "os sort" mode to keep both features https://stackoverflow.com/a/18415320/863980. it's a feature request "add os sort mode" for dotnet library, not a bug.

@siegfriedpammer
Copy link

siegfriedpammer commented Apr 11, 2024

It's a bug when certain characters that are seen as "natural" numbers in some cultures and even recognized by the Unicode Standard are not treated as such, imho. Similarly, as a native speaker of German, I would be annoyed if Ä was sorted after Z instead of after A, just because the guy implementing the "natural" sort didn't see it as natural or didn't know about it.

@kasperk81
Copy link
Contributor

"how does it work in other languages on various platforms?" has higher precedence for dotnet than win32 behavior

@siegfriedpammer
Copy link

Yes, as I already pointed out

I agree with @sharwell that using StrCmpLogicalW as a reference is problematic, but my point still stands that correct Unicode support is necessary but tricky.

I am solely interested in getting proper Unicode support and the Win API does a better job at that than the alternatives I found. At this point for me it's only about Unicode and nothing else.

@CyrusNajmabadi
Copy link
Member

Could you submit that support to that open source library you reported the bug to?

@tompazourek
Copy link

Could you submit that support to that open source library you reported the bug to?

I think it's my library and this issue: tompazourek/NaturalSort.Extension#74

I just added the Unicode support.

@CyrusNajmabadi
Copy link
Member

@tompazourek thanks very much! :)

@iSazonov
Copy link
Contributor

@tannergooding @tarekgh ICU is now well integrated into Runtime and probably most of the work has already been done. Is there a chance to include this in .Net 9?

@IDisposable
Copy link
Contributor

IDisposable commented Jul 5, 2024

@tannergooding @tarekgh I would like to take getting this over the finish line... are the API reviews current and approved still?

@tarekgh
Copy link
Member

tarekgh commented Jul 5, 2024

This will be risky to do it now. Let's plan doing it early next cycle.

@jeffhandley
Copy link
Member

Assigning to @PranavSenthilnathan to try to get this in for .NET 10 Preview 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests