Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an Enumerable.Match method #39443

Closed
simon-curtis opened this issue Jul 16, 2020 · 67 comments
Closed

Add an Enumerable.Match method #39443

simon-curtis opened this issue Jul 16, 2020 · 67 comments
Assignees
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@simon-curtis
Copy link

simon-curtis commented Jul 16, 2020

Background and Motivation

Simple proposal to return both matched and unmatched results from a where query in a one-liner.

var chars = new char[] { 'a', 'b', 'c' };

(matched, unmatched) = chars.Partition(c => c == 'a');
for (var element in unmatched) ...

//or

IMatchResults results = chars.Partition(c => c == 'a');
for (var element in results.Unmatched) ...

For readability I would normally do the following; but requires and extra pass

var chars = new char[] { 'a', 'b', 'c' };
var matched = chars.Where(c => c == 'a');
var unmatched = chars.Where(c => c != 'a');

You can achieve this in one pass using a for loop, but goes against the point of Linq and is longer

var chars = new char[] { 'a', 'b', 'c' };
var matched = new List<char>();
var unmatched = new List<char>();
foreach (var c in chars) {
    if (c == 'a') {
        matched.Add(c);
    } else {
        unmatched.Add(c);
    }
}

Proposed API

The proposed method buffers the results. The returned type is a tuple of IReadOnlyLists to better communicate this fact.

namespace System.Linq
{
    public static class Enumerable
    {
+        public static (IReadOnlyList<TSource> Satisfied, IReadOnlyList<TSource> Falsified) Partition<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
    }
}

We should not be implementing an IQueryable overload.

@333fred 333fred transferred this issue from dotnet/csharplang Jul 16, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Linq untriaged New issue has not been triaged by the area owner labels Jul 16, 2020
@ghost
Copy link

ghost commented Jul 16, 2020

Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.

@333fred
Copy link
Member

333fred commented Jul 16, 2020

This is an API request, moving to the correct repo.

@simon-curtis
Copy link
Author

@333fred oh sorry, at some point it must have sent me to the runtime repo... I will close and re-open in the correct repo

@333fred 333fred reopened this Jul 16, 2020
@333fred
Copy link
Member

333fred commented Jul 16, 2020

@simon-curtis sorry if that wasn't clear. I moved it already.

@canton7
Copy link

canton7 commented Jul 16, 2020

Just a note, there was a big discussion about this on gitter earlier this year. Turns out it's tricky.

One problem is that Linq wants to be lazy (where possible) and avoid iterating the source collection twice. It's hard to do a partition like this in a lazy way, because it depends on when you decide to enumerate matched and unmatched: you inevitably end up having to buffer.

Another problem is that IEnumerator<T> is IDisposable (and things like File.ReadLines rely on this to close a resource when enumeration finishes). So we have to make sure that the enumerator returned from chars.GetEnumerator() is appropriately disposed when you've finished iterating both matched and unmatched (and there's no motivation for the user to dispose the enumerators returned from both of them: it's easy to just leave one of them un-iterated if some condition is met)

Another is, what happens if you call matched.GetEnumerator() or unmatched.GetEnumerator() a second time? Do you get another pair of "connected" enumerators?

(You can do something ugly like this to try and overcome those, but it has its own problems.)

@simon-curtis
Copy link
Author

@simon-curtis sorry if that wasn't clear. I moved it already.

Oh I see, thanks! Sorry, still reasonably new to GitHub

@jnm2
Copy link
Contributor

jnm2 commented Jul 16, 2020

Basing it on IEnumerable is dubious as mentioned, but declaring this for concrete collection types like ImmutableArray and ImmutableList is very handy. I call my extension methods Partition like Rust's.

@Clockwork-Muse
Copy link
Contributor

GroupBy can at least get you close, as a workaround. Note this won't be two separate iterators. Or (ab)using Join for a full-lazy iterator (although results won't be grouped by partition).

@eiriktsarpalis
Copy link
Member

The F# equivalent is also called partition. In C# it might look as follows:

public static class Enumerable
{
    public static (List<TSource> Satisfied, List<TSource> Falsified) Partition<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
}

@eiriktsarpalis eiriktsarpalis added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jul 16, 2020
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Jul 16, 2020
@slavanap
Copy link

slavanap commented Jul 18, 2020

What about this approach?

//Microsoft (R) Visual C# Compiler version 2.9.0.63208 (958f2354)

using System;
using System.Collections.Generic;
using System.Linq;

namespace Rextester
{
    public static class MatchExtension {
        public static (IEnumerable<T>, IEnumerable<T>) Match<T>(this IEnumerable<T> c, Func<T, bool> cond) {
            var dict = c.GroupBy(cond).ToDictionary (k => k.Key, v => v);
            return (dict[true], dict[false]);
        }
    }
    public class Program
    {
        public static void Main(string[] args)
        {
            var c = new [] {"a","b","a"};
            var (matched, unmatched) = c.Match(k => k=="a");
            Console.WriteLine(string.Join(",",matched));
            Console.WriteLine(string.Join(",",unmatched));
        }
    }
}

P.S. It will buffer full collection as far as I remember, which may be not desired behavior.

@jnm2
Copy link
Contributor

jnm2 commented Jul 18, 2020

.ToLookup(...) is better than .GroupBy(...).ToDictionary(...). More efficient is:

foreach (var value in source)
{
    (predicate(value) ? trueBuilder : falseBuilder).Add(value);
}

@slavanap
Copy link

slavanap commented Jul 18, 2020

@jnm2 good point!

To sum up, currently there are 2 solutions for the Match problem:

  1. copy (cache) elements into new collections (.ToLookup, 3rd code chunk in @simon-curtis message),
  2. run 2 iterators over original IEnumerable (for example, if IEnumerable created from a file) (2nd code chunk in @simon-curtis message)
using System;
using System.Collections.Generic;
using System.Linq;

public static class MatchExtension {
    public static (IEnumerable<T>, IEnumerable<T>) MatchBuffered<T>(this IEnumerable<T> c, Func<T, bool> cond) {
        var l = c.ToLookup(cond);
        return (l[true], l[false]);
    }
	
    public static (IEnumerable<T>, IEnumerable<T>) Match<T>(this IEnumerable<T> c, Func<T, bool> cond) {
        return (c.Where(cond), c.Where(v => !cond(v)));
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        var c = new [] {"a","b","a"};
        {
            var (matched, unmatched) = c.Match(k => k=="a");
            Console.WriteLine(string.Join(",",matched));
            Console.WriteLine(string.Join(",",unmatched));
        }
        {
            var (matched, unmatched) = c.MatchBuffered(k => k=="a");
            Console.WriteLine(string.Join(",",matched));
            Console.WriteLine(string.Join(",",unmatched));
        }
    }
}

The non-buffered approach can be extended for IQueryable, but would require a bit code for original Expression negation.

@simon-curtis
Copy link
Author

simon-curtis commented Jul 18, 2020

I had this thought from a project I’m doing in blazor with EFCore, where I was separating entities’ UI elements by its archived property.

foreach (var entity in matched) ...
foreach (var entity in unmatched) ...

I wouldn’t want to go back to the database on every render so would definitely want some form of buffering, in this instance I’m using DbContext. But I could see how in other cases you wouldn’t want buffering.

I did end up using GroupBy().ToDictionary(), just not as a static method so I will do that on Monday to make my life easier than doing it every time, but looked messy and that’s why I’m suggesting a one-liner

@jnm2
Copy link
Contributor

jnm2 commented Jul 18, 2020

How would the non-buffered approach work when the matched or unmatched enumerables are enumerated more than once?

Assume that source returns a different sequence on each enumeration, which enumerables are explicitly allowed to do:

var (matched, unmatched) = source.Partition(cond);

var a = matched.Select(...).ToList(); // Fully enumerate matched enumerable once
var b = unmatched.Select(...).ToList(); // Fully enumerate unmatched enumerable once

var c = matched.Select(...).ToList();  // Fully enumerate matched enumerable a second time
var d = unmatched.Select(...).ToList();  // Fully enumerate unmatched enumerable a second time

You would assume that elements in a and b are based on the same source sequence and are correlated by index. But what about c and d?

How many times total is source enumerated in the sample above? How is this accomplished?

@slavanap
Copy link

For non-buffered approach I've personally assumed that the source sequence is stable, i.e. does not change over time. For files & databases this is easily achievable. For random generators - of course, not.

@jnm2
Copy link
Contributor

jnm2 commented Jul 18, 2020

For enumerables backed by databases, enumerating source a second time means querying the database a second time. The database is usually capable of returning a different sequence at that point. I might be wrong, but I think the way to state through strong typing that a sequence is stable is IReadOnlyCollection<T>.

@slavanap
Copy link

slavanap commented Jul 18, 2020

@jnm2 , yes. Enumerating database twice will query database twice. You can lock database state with a transactions mechanism. But for databases one would need to implement approach with IQueryable to stream data effectively, because conversion to IEnumerable will trigger data stream of full results to be sent over network.

@jnm2
Copy link
Contributor

jnm2 commented Jul 18, 2020

If the method accepts IEnumerable<T>, people will certainly pass enumerables that return different sequences on each enumeration. This isn't far-fetched from the code I've seen. If the Partition method doesn't buffer, what should the returned enumerables do if they are enumerated more than once?

@slavanap
Copy link

Usually it depends on the implementation. IEnumerable is allowed to be enumerated more than once.

@jnm2
Copy link
Contributor

jnm2 commented Jul 18, 2020

@slavanap My worry is, what should that implementation be? I'm not sure how to avoid buffering and be safe. And you can't get away from all buffering. If the first half doesn't match and the second half does, and you fully enumerate the matched enumerable first, you have to buffer half the data anyway.

@slavanap
Copy link

@jnm2 you dispose data while enumerating (and with IQueryable you query each half separately without loading any data from another half).

If source IEnumerable sources data from file or network, then your IEnumerable size is not constrained by RAM capacity.

While buffered IEnumerable is simpler it has its limitations, like RAM constraints.

@terrajobst
Copy link
Member

terrajobst commented Feb 9, 2021

Video

  • Let's return the raw List<T> because it's more useful to callers (and we likely can't change the instance anyway)
  • Let's rename the tuple fields to Matched and Unmatched to make easier to understand
  • We should rename the method to Match
namespace System.Linq
{
    public partial static class Enumerable
    {
        public static (List<TSource> Matched, List<TSource> Unmatched) Match<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
    }
}

@eiriktsarpalis eiriktsarpalis changed the title Add an Enumerable.Partition method Add an Enumerable.Match method Feb 9, 2021
@eiriktsarpalis eiriktsarpalis added the help wanted [up-for-grabs] Good issue for external contributors label Feb 9, 2021
@eiriktsarpalis eiriktsarpalis modified the milestones: Future, 6.0.0 Feb 9, 2021
@MgSam
Copy link

MgSam commented Feb 10, 2021

I strongly believe you should return IReadOnlyList<T> and not List<T>. You are constraining your implementation in now until forever if you return the concrete List<T> type. The method signature should not be dictated by the implementation whenever possible.

@eiriktsarpalis
Copy link
Member

You are constraining your implementation in now until forever if you return the concrete List type. The method signature should not be dictated by the implementation whenever possible.

I don't see how this implementation would ever use anything other than List<T>, so we opted for the simple solution. There are benefits to returning a concrete type and there is certainly precedent in LINQ, e.g. the ToDictionary() method.

@MgSam
Copy link

MgSam commented Feb 10, 2021

I don't see how this implementation would ever use anything other than List...

We never know now what we'll know in the future. That's the entire point of defensive coding. The framework is littered with stuff that is now considered obsolete or bad practice that seemed like a great idea at the time.

...there is certainly precedent in LINQ, e.g. the ToDictionary() method.

ToDictionary() and ToList() are totally different. They are for the express purpose of getting those concrete types back.

No other LINQ methods return concrete types. Including GroupBy, where you are actually constructing a completely new collection but hiding it behind an interface.

There are benefits to returning a concrete type...

Like what? The only benefit that I can think of is that the user will mutate it, which is not something the BCL should be encouraging.

Returning IReadOnlyList<T> is a much safer and better practice and should be the example the BCL sets. It is ordered, it has a count; the only thing it cannot do is mutate.

Returning List<T> from any public method is a code smell and IMO it should absolutely be not baked into a BCL LINQ method.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Feb 10, 2021

I don't see how this implementation would ever use anything other than List...

We never know now what we'll know in the future. That's the entire point of defensive coding. The framework is littered with stuff that is now considered obsolete or bad practice that seemed like a great idea at the time.

While I understand the point you are trying to make in principle, I don't agree that it applies to this particular case.

ToDictionary() and ToList() are totally different. They are for the express purpose of getting those concrete types back.

No other LINQ methods return concrete types. Including GroupBy, where you are actually constructing a completely new collection but hiding it behind an interface.

Note that the new method has more similarities with ToDictionary and ToList than with GroupBy: it eagerly consumes the source enumerable and returns a materialized collection. We actually debated following the ToType naming convention but just couldn't find an appropriate name. This is also why the method doesn't come with an accompanying Queryable overload.

Returning List from any public method is a code smell and IMO it should absolutely be not baked into a BCL LINQ method.

Again, I understand the point you are making in principle, but I don't see how that translates to the particular method. What particular problems would you expect this cause for users?

@jnm2
Copy link
Contributor

jnm2 commented Feb 10, 2021

@MgSam This was discussed in the video at https://youtu.be/7bRCbwE9CYE?t=2772. Someone mentions:

We do have a principle which is basically: parameters should accept the weakest types they need and return types should be the strongest type you have.

Also, the design team saw it as a good thing for users to be able to own and modify the created instances.

@MgSam
Copy link

MgSam commented Feb 10, 2021

I think a better analog rather than ToDictionary or ToList is Regex.Matches. It's doing conceptually the same thing as this Partition method- except it works on an IEnumerable<char> (aka, a string) instead. And it returns MatchCollection, which is a read-only collection of the different partitions.

We do have a principle which is basically: parameters should accept the weakest types they need and return types should be the strongest type you have.

I totally agree with this principle- which is why I'm proposing IReadOnlyList<T> and not IEnumerable<T>. Having an ordering and a count are both super useful properties which you lose if you return IEnumerable.

What particular problems would you expect this cause for users?

There are multiple problems:

  • Encouraging the bad practice of mutating a collection that wasn't created by you and doesn't belong to you.
  • Baking an inconsistency into the heart of LINQ. Partition is a specialized case of GroupBy, but while GroupBy returns an IGrouping, Partition will return a List<T>?
  • Using a concrete type when an interface would be provide future-proofing. (Thus ensuring that if Eirik Jr in 20 years wants to change the implementation of Partition so that it instead returns some new magical type that uses a quantum cache, he can't do so).

The interface vs non-interface is the point I still really don't understand and which you haven't answered. Even if we all agreed it was a great idea to return a mutable list as the result of Partition- why wouldn't we use IList<T> rather than List<T>? At the end of the day the vast majority of the users won't care if the implementation of the result is a List<T> or a PartitionCollection or an array or an ImmutableList<T> under the hood.

@jnm2
Copy link
Contributor

jnm2 commented Feb 11, 2021

@MgSam

Encouraging the bad practice of mutating a collection that wasn't created by you and doesn't belong to you.

But since they explicitly wanted the semantics of this API to be that the collections being created do belong to you, this is a good reason to return a concrete collection type.

Baking an inconsistency into the heart of LINQ. Partition is a specialized case of GroupBy, but while GroupBy returns an IGrouping, Partition will return a List<T>?

As Eirik said above, this is a matter of perspective. This is intended to diverge from GroupBy because it's intended to serve a slightly different purpose. This usage pattern is not perfectly served by GroupBy, thus the difference and the reason for this API request.

Using a concrete type when an interface would be provide future-proofing. (Thus ensuring that if Eirik Jr in 20 years wants to change the implementation of Partition so that it instead returns some new magical type that uses a quantum cache, he can't do so).

This was addressed in the video. It's really an interesting listen.

@MgSam
Copy link

MgSam commented Feb 11, 2021

@jnm2 I did listen, and if anything the video proves why this issue deserves more thought. They barely spent about two minutes discussing it where two of the participants opinions are "I don't really care" and the third basically says "let's return List because that's what we're using internally". They do discuss the immutable vs mutable question but don't present a very compelling argument IMO. The question of why List<T> vs IList<T> isn't even addressed. Certainly no one brings up the obvious parallel to Regex.Matches.

Public methods should not be exposing implementation details. .NET should not change framework patterns arbitrarily because the developers reviewing the code can't be bothered to make an opinion about it or consider it in the wider context of the rest of the framework.

At the end of the day, it's a minor issue either way but I want to see .NET be the best platform it can be. And I certainly don't want to see LINQ proper become an everything-and-the-kitchen-sink mess of methods with inconsistent usage and documentation like Rx is.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Feb 11, 2021

The question of why List vs IList isn't even addressed.

Simply put, this is an elementary helper method for which List<T> offers the canonical, optimal solution. There are no "clever" optimizations that we would ever consider including, quantum or otherwise.

Public methods should not be exposing implementation details.

I think you might be applying a liberal definition of "implementation detail". List<T> is a public type that is fairly well understood and is already being used by Linq APIs. Perhaps using ValueTuple in the return type is leaking implementation detail too? What about Enumerable.Count() only returning 32-bit integers? I'm not being facetious, it is fairly easy to have enumerables whose count exceeds int.MaxValue.

.NET should not change framework patterns arbitrarily because the developers reviewing the code can't be bothered to make an opinion about it or consider it in the wider context of the rest of the framework.

What "framework patterns" are you referring to? Here's a quote from the framework design guidelines, also mentioned during the meeting:

DO use the most derived type for return values and the least derived type for input parameters. For example take IEnumerable as an input parameter but return Collection as the return type. Provide a clear API entry point for every scenario. Every feature area should have preferably one, but sometimes more, types that are the starting points for exploring given technology. We call such types Aggregate Components. Implementation of large majority of scenarios in given technology area should start with one of the Aggregate Components.

Ultimately, would you be able to propose a potential future concern, particular to this method, that might justify abstracting the return type? Abstraction comes with tradeoffs, so we won't be adopting it for the sake of an ambiguous sense of futureproofing.

@canton7
Copy link

canton7 commented Feb 11, 2021

I just want to make a small point as an observer -- This is Microsoft's house, and they've kindly invited us in to see how they do things. Remember that they are the owners and this is their job: we can make suggestions and give our personal opinions, but we do not have the right to tell them what to do.

I've seen first-hand the annoyance that some stranger telling you how to do your job causes, and I'm keen that we don't risk ruining the great privilege we have here :)

(I don't want to debate this message, just dropping past)

@danmoseley
Copy link
Member

@MgSam

because the developers reviewing the code can't be bothered to make an opinion about it

We welcome your interaction and input here. At the same time, please assume good intent. This is called out in our code of conduct.

@terrajobst
Copy link
Member

@MgSam

I just want to elaborate on a few points here:

  1. The reason we return a concrete type is because we practically can't change the implementation unless the concrete type implementing the interface is fully internal. The reason being that people very often (accidentally or intentionally) downcast the result from, say, IList<T> to List<T>, which would break if we suddenly return a different instance. "But that's theoretical". Sadly it's not, we learned that the hard way when we changed Enumerable.Empty<T>() from returning an empty T[] to an internal type that implements IEnumerable<T>, so we undid that change and kept returning T[]. That's why @bartonjs and me believe that if the framework returns a public type, we're better of using the concrete type.

  2. "The team only spent two minutes on it". That is fair and I'm not gonna claim that the API review team has perfect visibility on all the issues presented to us, including the design trade offs. However, that doesn't mean that nobody does. The domain expert for Linq is @eiriktsarpalis and he presented the issue to us and answered our question. His expertise, and the time he has spent on this thread alone, is far more than the two minutes you cite.

  3. We value different view points. As should be obvious to anyone watching our API reviews we ourselves rarely have a uniform view point on issues. In fact, whenever we do I'm slightly nervous that we're missing something because virtually no design is perfect. If you think it is, it usually means you're not aware of some shortcoming, not that it's free of them. I think your concerns and criticisms have merit. But as @danmoseley pointed out tone matters. Rest assured we all get frustrated and concerned when other people don't understand or don't care for our arguments. But snapping and lashing out at people isn't an option because it would meant that 90% of our discussions would become toxic wastelands as consensus never exists at the beginning, it is created through discourse and exchange of criticism.

@terrajobst
Copy link
Member

terrajobst commented Feb 11, 2021

@canton7

I just want to make a small point as an observer -- This is Microsoft's house, and they've kindly invited us in to see how they do things. Remember that they are the owners and this is their job: we can make suggestions and give our personal opinions, but we do not have the right to tell them what to do.

I've seen first-hand the annoyance that some stranger telling you how to do your job causes, and I'm keen that we don't risk ruining the great privilege we have here :)

Thanks for the kind words. I just want to point out that this is a two way street for us: we equally feel very privileged of being lucky that so many contributors take their time and help us make .NET better. That is a big reason why I care about the conduct in our discussions: it's not just how people talk to us, it's how we all talk to each other. I don't want to lose contributors or potential contributors because they see a toxic thread and decide that this isn't the community they want to be a part of.

@stephentoub
Copy link
Member

stephentoub commented Feb 11, 2021

Sadly it's not, we learned that the hard way when we changed Enumerable.Empty() from returning an empty T[] to an internal type that implements IEnumerable, so we undid that change and kept returning T[]

(@terrajobst, just FYI, we've actually been returning something other than T[] for several releases now, since .NET Core 3.0 if memory serves.)

@terrajobst
Copy link
Member

(@terrajobst, just FYI, we've actually been returning something other than T[] for several releases now, since .NET Core 3.0 if memory serves.)

Don't ruin my brilliant tale with facts 😄

@danmoseley
Copy link
Member

danmoseley commented Feb 11, 2021

IDictionary GetEnvironmentVariables() is a better example. It was changed to return a Dictionary<string, string> but that was too breaking so it had to go back to a Hashtable. This actually broke me on a previous product -- it's so tempting to downcast that IDictionary, because you know what it is, right? That isn't to say we never make changes like that, particularly in .NET Core. But there is a cost/benefit if we do -- the flexibility can be illusory.

@MgSam
Copy link

MgSam commented Feb 13, 2021

Thanks guys for the explanation. I agree there definitely can be unintended consequences even with an interface as the return type if you ever needed to change that, especially at the BCL level. That said, someone's reliance on internal implementation details is a much "easier" breaking change to swallow than changing a concrete type.

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 18, 2021
@eiriktsarpalis
Copy link
Member

Per #49819 (review) we have reached the conclusion that the proposed method, while useful for some applications, does not meet the bar for inclusion in System.Linq. I'm therefore going to close this issue.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 17, 2021
@teo-tsirpanis teo-tsirpanis added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed api-approved API was approved in API review, it can be implemented labels Apr 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.