Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add the group_by and group_by_mut methods to slice #2477

Closed
wants to merge 1 commit into from

Conversation

Kerollmops
Copy link
Contributor

@Kerollmops Kerollmops commented Jun 15, 2018

This RFC propose to add two new methods to the slice, the group_by and group_by_mut. These two will provide a way to iterate over non-overlapping sub-slices of a base slice that are separated by the predicate given by the user (e.g. Partial::eq, |a, b| a < b).

The predicate is called on two elements following themselves, it means the predicate is called on slice[0] and slice[1] then on slice[1] and slice[2]...

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

Pending Pull Request

Work around temporary library

Rendered

@CodesInChaos
Copy link

CodesInChaos commented Jun 15, 2018

I don't like the naming. group by in SQL and C# work quite differently from what you propose, they map each item to a key and then group all items having the same key.

Something like split_between would fit the behaviour better.

@Lokathor
Copy link
Contributor

I guess, but this is exactly what groupBy does in Haskell, so the name should stay.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jun 15, 2018

@CodesInChaos the function you describe has the same behavior as GroupWith in Haskell.

@ssokolow
Copy link

ssokolow commented Jun 15, 2018

I don't like the naming. group by in SQL and C# work quite differently from what you propose, they map each item to a key and then group all items having the same key.

Something like split_between would fit the behaviour better.

I guess, but this is exactly what groupBy does in Haskell, so the name should stay.

I don't like the "My language is more important than yours" tone of this exchange.

That said, from my perspective, a name like split_between still wins out because I haven't seen it anywhere and I consider "no preconceptions" to be better for learnability than being torn between the incompatible preconceptions of two different languages, both significant.

Assuming that group_by turns out to be a non-starter for this reason, is there another name with prior art that could be proposed?

@rvolgers
Copy link

This seems pretty similar to the existing take_while and skip_while functions. But it's even more specific, so I'd want to see some clear use cases. Maybe it's obvious to Haskell programmers, but it isn't to me 😄

I can imagine some cases where you iterate over a list and you can skip some work if some predicate hasn't changed. But in that case, it seems clearer to me to write it as a single loop with that part of the behavior written out with a prev_value variable and in imperative form. It may be a couple more lines of code but it makes the shape of the code a lot more representative of what is being accomplished (a single iteration where the processing has a special fast path case).

@Centril
Copy link
Contributor

Centril commented Jun 15, 2018

@ssokolow

I don't like the "My language is more important than yours" tone of this exchange.

I don't think that's it. There's a good reason to favor a Haskell based name in the case of iterator methods, functions, and related things because Rust already uses Haskell based naming in such cases. For example: map, zip, fold, take, take_while, scan, partition, all, any, find, unzip, cycle, sum, product, repeat, splitAt. There should be some consistency in naming, and group_by is consistent.

There's also precedent from itertools in naming this operation group_by.

I also think it's a telling and descriptive name. You are grouping elements by where the predicate applies. (Yes, a split happens every time the predicate doesn't match, so the order matters, but if you sort, then it does not. The minor difference can be cleared up in documentation..)

@rvolgers

But it's even more specific, so I'd want to see some clear use cases. Maybe it's obvious to Haskell programmers, but it isn't to me 😄

Grouping is a pretty common operation; Think of GROUP BY in SQL -- a lot of those cases can be expressed as .group_by(..).

It may be a couple more lines of code but it makes the shape of the code a lot more representative of what is being accomplished (a single iteration where the processing has a special fast path case).

Generally speaking, at least to me, iterator style composition makes data flow clearer :)

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jun 15, 2018
@leonardo-m
Copy link

A related by different function in D language:
https://dlang.org/phobos/std_algorithm_iteration.html#group

And Python:
https://docs.python.org/2/library/itertools.html#itertools.groupby

So what semantics do we prefer?

@Centril
Copy link
Contributor

Centril commented Jun 15, 2018

@leonardo-m It seems to me that the semantics of D and Python can be recovered from the proposed semantics in this RFC (which is the same as the semantics in the itertools crate) so I would say that the proposed solution is more general.

@Ixrec
Copy link
Contributor

Ixrec commented Jun 15, 2018

To me, it's not even sufficiently clear what the RFC is proposing. The name group_by makes me expect an array like [1, 1, 2, 1] to get turned into the groups [1, 1, 1] and [2]. But the fact that this is an iterator method makes me expect a more linear-time-y behavior like [1, 1], [2] then [1]. The RFC's summary sentence does use the term "runs", which I am just barely familiar with so that also seems to imply [1, 1], [2] then [1]. But the RFC doesn't contain a single example of expected inputs and outputs. And in the ~200 lines of tests in the linked implementation, not one test covers this distinction!

The reference implementation of next() seems like it would in fact produce [1, 1], [2] then [1], so I assume that was the author's intent. But still, that should be made utterly and idiot-proof-ly obvious in the RFC, without having to read an implementation.

Another problem with providing an implementation and no specification is that I have no idea what the guarantees/requirements are for the predicate. Maybe this is a silly question, but am I guaranteed that it gets invoked with (a[0], a[1]) then (a[1], a[2]) and so on in that order exactly once each, or is the implementation allowed to do things like (a[0], a[1]) then (a[0], a[2]) and so on? I assume everyone's assuming the former, but since there isn't a single example or test of a predicate other than equality (panic_param_ord doesn't count), it's not super clear what more interesting predicates I should or should not be able to pass to this algorithm.

Which is all a really long way of saying I don't think we can properly bikeshed the name quite yet. In particular, I think we need to see some examples of more interesting predicates (I can't come up with any) to figure out whether "groups", "runs" or "splitting" is the least misleading term for what's going on.

@hadronized
Copy link
Contributor

hadronized commented Jun 15, 2018

Assuming that group_by turns out to be a non-starter for this reason, is there another name with prior art that could be proposed?

regroup_by? contiguous_group_by? 💃

But the RFC doesn't contain a single example of expected inputs and outputs. And in the ~200 lines of tests in the linked implementation, not one test covers this distinction!

Have a look at the few tests I wrote in my allocation version, here.

@Centril
Copy link
Contributor

Centril commented Jun 15, 2018

@Ixrec I would expect the behavior to be equivalent to https://docs.rs/itertools/0.7.8/itertools/trait.Itertools.html#method.group_by except that you get a slice as the element of the returned iterator instead of an IntoIterator.

@phaazon What's the prior art for regroup_by?

@Kerollmops Small library additions such as the one proposed in this RFC have historically been accepted with a PR against rust-lang/rust, so I would recommend filing a PR against that repo so that we can see the concrete implementation, tests, etc. and then we can have a much clearer idea.

@hadronized
Copy link
Contributor

@Centril About regroup_by, none, and I think:

  • regroup_by semantics would sort upfront, which is not the case of this suggested implementation.
  • group_by is fine enough but if people don’t like it, it shouldn’t be very hard to find a proper name based on the current Iterator other methods.

@hadronized
Copy link
Contributor

continuous_chunks and continuous_chunks_by?

@Ixrec
Copy link
Contributor

Ixrec commented Jun 15, 2018

@Centril Interesting, the itertools method expects an element -> key callback and strongly implies the keys will be compared via PartialEq to determine the groups.

What other use cases does this proposal hope to support by generalizing to a (a, b) -> bool callback? If equality and key equality are all we know of, I'd rather have the more obvious itertools signature.

@hadronized
Copy link
Contributor

(Also, not related to the actual discussion, this is more meta about the commit: @Kerollmops, you seem to have pushed this commit with a company email address. I think you should check twice and possibly rebase with your OSS identity – your commit is also not verified).

@Centril
Copy link
Contributor

Centril commented Jun 15, 2018

@Ixrec I would provide the following hierarchy:

  • The general operation group_by using (T, T) -> bool
  • The half-general operation group_by_key or group_key using T -> Key, Key: PartialEq
  • The specialized operation group using T: PartialEq

The two latter ones can be implemented in terms of the first one. I would not use the name group_by for T -> Key, Key: PartialEq to leave room for the most general version even if we don't add it now.

EDIT: The use cases for the most general version is for when you have some different notion of equality than the natural one ((Partial)Eq) and it is not field based; for example: you want to group all alpha-equivalent terms and you have some function is_alpha_equivalent : (&Term, &Term) -> bool. There are many other interesting equivalences in this space.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jun 15, 2018

It could probably be named split_by.
@Ixrec I will update the RFC to add output examples.
@Centril Ok I will create a PR against rust-lang/rust.

@Centril
Copy link
Contributor

Centril commented Jun 15, 2018

I think it would be confusing to have split and then split_by with entirely different semantics.

@Lokathor
Copy link
Contributor

Yeah, split is absolutely not the verb, even if group is also not the verb (though i think group is the best verb here)

@uberjay
Copy link

uberjay commented Jun 15, 2018

Ruby calls this chunk — https://ruby-doc.org/core-2.5.1/Enumerable.html#method-i-chunk

Of course, that’s already taken. Would it be too confusing if it were chunk_by? Perhaps...

group_by makes me think it’ll behave like the (obviously more expensive) ruby method of the same name, which enumerates and groups/collects (up front) into a hash.

Now... that’s probably not a huge issue, because (in rust :) nobody will accidentally treat the result of group_by as a HashMap?

@Ixrec
Copy link
Contributor

Ixrec commented Jun 15, 2018

It seems like I'm the only one that finds it unintuitive for group to only produce contiguous runs of equivalent elements rather than gathering up all equivalent elements. But just in case I'm not alone: runs, runs_by_key and runs_by seem like acceptable names to me.

@uberjay
Copy link

uberjay commented Jun 15, 2018

@lxrec - I also find group_by unintuitive. It's possible my view is colored by ruby use, but my feeling is that group doesn't convey the sequential nature of the operation.

For fun, I asked someone nearby who isn't familiar with any of the languages in question, and they felt pretty strongly that chunk was a far better word than group, for this operation. Totally un-scientific survey of one person! 😬

Now, there's already chunks, which is for fixed length chunks -- maybe chunks_by wouldn't be that odd? seq_group_by?

@ssokolow
Copy link

ssokolow commented Jun 15, 2018

I'd go with the runs-based naming because "run" is existing jargon (See, for example, Run-length encoding) and, if you're familiar with that jargon, it's intuitive for runs_by to have a similar meaning to passing a key-derivation function into a sorting routine.

(That is, assuming the "contiguous runs" interpretation which would preserve lazy evaluation of the iterator. Otherwise, perhaps something in the vein of collect_by to draw attention to the eager way it must necessarily operate to account for lists not sorted by the derived key first.)

@Lokathor
Copy link
Contributor

Actually group_by will produce all equivalent elements in each output batch if (and only if) the slice is already sorted and the closure used is an equality test. Otherwise you'll see some other sort of result.

Remember that group_by takes &self (for some [T]), so you can't re-arrange any of the values (since you can't mutate the slice's memory backing), and you also can't move the values into temporary memory and sort there (since you don't bound on T: Copy). Thus, the only possible output is a series of sub-slices of the starting slice that don't have any order changes.

The only question left is one of the ones you already said: assuming the first slice starts with (a[0], a[1]), do we keep checking the next element against the start of the slice or against the latest element of the slice?

@uberjay
Copy link

uberjay commented Jun 16, 2018

Ok, a few interesting alternatives from the thesaurus:

segment_by - this actually reads quite nicely to me.
snip_by - ok, but a little weird, and probably too similar to split.

@ssokolow
Copy link

ssokolow commented Jun 16, 2018

@uberjay

I'd go with segment_by. snip_by has a "too clever" feel to it.

Aside from having no precedent I'm aware of, it's too focused on process rather than effect. "Snipping" is what you do to accomplish a goal like "trimming" or "splitting" rather than a goal in itself and says nothing about which snipped pieces, if any, will be retained or discarded. (ie. Are you "snipping [something] off" or "snipping [things] apart"?)

Therefore, it's neither as intuitive as is ideal when looking through a list of methods with a goal in mind, nor obvious about what it does when looking at a use of it in code.

It also generally has a "too informal to fit in with the other terminology" feel to it.

@scottmcm
Copy link
Member

The motivation here reads to me like "this makes it easier to do what this does", which I don't find persuasive. Why is it common to have this problem, to have the input in exactly the form needed for this proposed implementation, and to not just want the eager HashMap-producing version?

@Ixrec You're definitely not alone; I would absolutely expect "group by" to have the relational algebra meaning as well.

@Lokathor
Copy link
Contributor

well this works in a no_alloc situation, for one

@burdges
Copy link

burdges commented Jun 16, 2018

I suppose split and split_mut excluding the matched element was a dubious design decision, given that one can shorten slices easily enough. Anyways you can implement a version that gives slightly short slices like:

let mut b = None;
slice.split(|a| { let c = b; b = Some(a);  b != c })

@burdges
Copy link

burdges commented Jun 17, 2018

Arguably, these should be called split and split_mut and the existing split and split_mut should be renamed to lines_by or separated or whatever, but presumably that's impossible now.

You can almost implement split with this, assuming you capture the closure's lifetimes.

pub fn split<'a F>(&'a self, pred: F) -> impl Iterator<Item=&'a [T]>
  where F: FnMut(&T) -> bool + 'a {
    self.group_by(|a,_| pred(a)).map(|x| x.split_last()).1
}
pub fn split_mut<'a,F>(&'a self, pred: F) -> impl Iterator<Item=&'a mut [T]>
  where F: FnMut(&T) -> bool + 'a {
    self.group_by_mut(|a,_| pred(a)).map(|x| x.split_last_mut()).1
}

It doesn't quite work however since pred(self[self.len()-1]) never gets evaluates and pred cannot be used in the map. #[async] can express this easily though I guess.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jun 18, 2018

I think something like this can correctly emulated the current slice::split method.

pub fn split<F>(&self, pred: F) -> impl Iterator<Item=&[T]>
where F: FnMut(&T) -> bool
{
    self.group_by(|a, _| pred(a))
        .enumerate()
        .map(|(i, slice)| {
            match i {
                0 => slice,
                _ => slice[1..],
            }
        })
}

EDIT: this code ^ is wrong !

But this is not really the subject here, split will be kept as you say, the name for the method I propose will be something between group_by and segment_by.

For the real discussion here is: do we really want to add this method to the standard library ?
@mark-i-m, @Centril and me are likely to merge it in because we faced use cases.

And I did not understand what you mean with the #[async].

@burdges
Copy link

burdges commented Jun 18, 2018

I think our discussion of deriving split from group_by reflects positively on including it in the standard library, well group_by sounds more fundamental.

As I said, it's unfortunate split looks soo similar to split_at while behaving so radically different. :(

Afaik, you'd need the lifetime so the return can borrow both self and pred, but you'd also need pred to appear in both closures, so you must wrap pred in a RefCell or Arc making it look more like:

pub fn split<'a F>(&'a self, pred: F) -> impl Iterator<Item=&'a [T]>
  where F: FnMut(&T) -> bool + 'a {
    let p = RefCell::new(pred);
    self.group_by(|a,_| p.borrow_mut()(a)).map(|x| {
        { let (h,t) = x.split_last();  if pred(*h) { return t; } }
        x
    })
}
pub fn split_mut<'a,F>(&'a self, pred: F) -> impl Iterator<Item=&'a mut [T]>
  where F: FnMut(&T) -> bool + 'a {
    let p = RefCell::new(pred);
    self.group_by_mut(|a,_| p.borrow_mut()(a)).map(|x| {
        { let (h,t) = x.split_last_mut();  if pred(*h) { return t; } }
        x
    })
}

In principle, the new #[async] generator function would permit borrowing pred across a yield point, but of course that's not enough to pass it to two separate closures, so made a mistake there.

@uberjay
Copy link

uberjay commented Jun 18, 2018

Apologies about the immediate bike-shedding.

In terms of the actual functionality, I've found the corresponding functionality pretty useful in Ruby-land for segmenting (or chunking, or grouping, if you will :) line-based output into logical chunks more easily. It's not a great day when you're stuck parsing the output of some random command line tool, but Enumerable#chunk made it a bit more pleasant.

So, it's been useful for me in situations where sorting beforehand doesn't make sense.

@mark-i-m
Copy link
Member

I think the immediate bikeshedding is a sign that people don't have any larger issues with the proposal...

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jun 22, 2018

Ok so the only blocking thing about this RFC seems to be the name chosen (group_by), as @Lokathor says this is the function name that is used in Haskell, and the Rust standard library is mostly inspired by it.

@uberjay propose segment_by and I think nobody is opposed to this name.

It seems that the only way I can make this RFC move forward is by changing the name from group_by to segment_by.

I am not convinced about this renaming, as @mark-i-m says:

I think the immediate bikeshedding is a sign that people don't have any larger issues with the proposal...

So I propose to tag this RFC as proposed-final-comment-period, publish it in this-week-in-rust to make it more visible and see other oppositions than naming.

@Kerollmops

This comment has been minimized.

@uberjay
Copy link

uberjay commented Aug 26, 2018

Totally agree -- it'd be great to have this, regardless of what it's named! I've run across a couple times where it would have come in handy over the past couple months, even!

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Sep 8, 2018

As the RFC seems to get stuck, I will ping someone from the Library team to take a decision.
I choose @Kimundi and @sfackler as he has already been assigned to the "final" pull request.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Sep 8, 2018

I propose to rename the method group and the returned iterator Group, this way it will follow the same rules as the split methods familly follows. The slice::split method returns a Split struct. Same pattern for rsplit, splitn, rsplitn and all the mutable versions.

Note that the Split struct is generic on the function so the Group struct must not be based on partial equality. In other word, no GroupBy or GroupByKey structs, only the Group one which is invoqued by the slice::group method with a generic function.

The same pattern can be found on other types like str::matches which returns a Matches struct that is generic on the predicate, no MatchesBy struct here.

@joshtriplett
Copy link
Member

Nominating for discussion in the next libs team meeting, to get the process un-stuck after the various bikeshed-painting. :)

@Kerollmops
Copy link
Contributor Author

@joshtriplett When does the next libs team meeting occurs ?

@Centril Centril added A-slice Slice related proposals & ideas A-types-libstd Proposals & ideas introducing new types to the standard library. labels Nov 22, 2018
@Kerollmops
Copy link
Contributor Author

Kerollmops commented Dec 27, 2018

For those interrested, I have made a temporary library that I will not publish to crates.io, providing a temporary workaround to this RFC/PR that has not been merged. It compiles on stable rust, the version that will be merged will use unstable functions to improve performance and clarity (i.e. offset_from).

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = GroupBy::new(slice, |a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

@KodrAus KodrAus added the Libs-Tracked Libs issues that are tracked on the team's project board. label Jul 29, 2020
@KodrAus
Copy link
Contributor

KodrAus commented Dec 9, 2020

Sorry we dropped the ball on this one. We've actually just recently restarted Libs meetings, so now, two years later we can say that for API additions like this we actually don't typically need an RFC and will happily land an unstable implementation. That wasn't widely understood by the wider Rust org at the time though.

If anybody would like to resubmit rust-lang/rust#51606 we'll be happy to land this API as unstable.

Sorry for leaving this for so long and thank you for putting all this together @Kerollmops 🙇

@Kerollmops
Copy link
Contributor Author

Thank you @KodrAus for your time, I understand that this is very hard to follow all of these PRs and RFCs so don't worry, I created my own library to support much more than just sequential groups (i.e. linear, binary and, exponential group of search).

I opened a new PR for this addition.

bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 31, 2020
The return of the GroupBy and GroupByMut iterators on slice

According to rust-lang/rfcs#2477 (comment), I am opening this PR again, this time I implemented it in safe Rust only, it is therefore much easier to read and is completely safe.

This PR proposes to add two new methods to the slice, the `group_by` and `group_by_mut`. These two methods provide a way to iterate over non-overlapping sub-slices of a base slice that are separated by the predicate given by the user (e.g. `Partial::eq`, `|a, b| a.abs() < b.abs()`).

```rust
let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);
assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);
```

[An RFC](rust-lang/rfcs#2477) was open 2 years ago but wasn't necessary.
@CodesInChaos
Copy link

CodesInChaos commented Jan 4, 2021

It looks like people are discussing two related but different functions in this issue:

  1. A function which takes a element->key mapping and compares adjacent keys with PartialEq. This is what itertools calls group_by
  2. A function which takes a splitting predicate (element, element) -> bool which decides if it should split between those elements. This is what the OP proposes. This function is more general, but also more annoying to use in the common case. I do not think this function should be called group_by.

@RustyYato
Copy link

Would split_if be a better name?

@KodrAus
Copy link
Contributor

KodrAus commented Jan 14, 2021

This is now being tracked in rust-lang/rust#80552

@CodesInChaos I've captured your concern about naming in that issue. We can continue discussion over there!

Thanks everybody who contributed here.

@KodrAus KodrAus closed this Jan 14, 2021
@Kerollmops Kerollmops deleted the group-by branch January 14, 2021 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-slice Slice related proposals & ideas A-types-libstd Proposals & ideas introducing new types to the standard library. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.