add few missing methods for set-like operations #23528

rfourquet · 2017-08-31T17:18:40Z

For intersect, union, setdiff and symdiff, and their mutating variants for Set and IntSet, there is now support for arbitrary many arguments consistently. What breaks is for one-argument versions: for example we had s=Set(); union(s) !== s but symdiff(s) === s. I chose here to always return a copy (this should be the only regression in terms of performance). Also, I realize that having setdiff accepting more or less than two arguments may be controversial; I find it natural, but can remove this bit if requested.
This should not get merged until nanosoldier runs the corresponding benchmarks.

PS: nice to have those improvements with a net decrease of LOC! (in particular when there are 28 more lines for tests)

EDIT: also, this takes care of improving the complexity of some operations mentioned here.
EDIT: this also adds filter/filter! for IntSet returning an IntSet (like for Set and Dict).

rfourquet · 2017-09-02T18:31:16Z

I now added mutating operations also for AbstractVector. Morevover:

intersect was the only operation among the 4 that was preserving the multiplicity; I thought at first it was a bug, until I discovered the specification in the docstring (but it was not tested in the test folder). I allowed myself to remove this inconsistent special case as it's seems arbitrary and error-prone: all set-like operations now return unique'd vectors.
It became clear that for performance, setdiff! must accept arbitrary number of arguments: a Set is constructed in the process, and foldl(v, setdiff!, itrs) would create unnecessarily too many of them.
the semantics of symdiff for IntSet and arrays were not consistent regarding multiplicity in the arguments:

julia> symdiff(IntSet([]), [1, 1])
IntSet([])

julia> symdiff([], [1, 1])
1-element Array{Any,1}:
 1

I harmonized the behaviors in favor of the first, which is the most natural to implement, and the more general as it's possible to call unique on the arguments to get the second behavior.

rfourquet · 2017-09-05T11:47:28Z

Someone against this new behavior? I will otherwise implement the deprecations for the breaking parts.

TotalVerb · 2017-09-05T15:31:25Z

Does the increased generality of non-uniquing operations apply in all cases though, and argue for vector union and intersect to treat them as multisets?

rfourquet · 2017-09-05T15:56:27Z

Does the increased generality of non-uniquing operations apply in all cases though, and argue for vector union and intersect to treat them as multisets?

Very good remark! it could be true, but I guess that in order to implement these semantics efficiently, it would necessitate to implement a real MultiSet type... Moreover it's already non-negligible work to implement just set semantics for vectors while keeping order; I wonder how difficult/easy it is to mix this order property with multisets semantics.

So in short, what I did come up with is the most pragmatic solution I could find: treating vectors as sets as much as possible, while keeping consistency (when possible) with operations on real sets, and if possible keeping the implementation straightforward (if the implementation has to become too complex for those functionalities, it may be a sign that it belongs to a package). But there may be a better compromise.

EDIT: BTW, the master's version of intersect on vectors, which maintains "multiplicity of the first argument", does still not implement what would be a mathematical intersection of multisets.

JeffBezanson · 2017-12-14T10:34:40Z

Sounds good but doesn't need to be on the milestone I think?

rfourquet · 2017-12-14T11:14:56Z

but doesn't need to be on the milestone I think?

It's a breaking change for intersect and symdiff, see my post above (I hope not many people relied on the old behavior), so I would say it needs a decision before feature freaze.

rfourquet · 2017-12-14T11:44:31Z

Also, if we make this change: do we first deprecate the relevant methods in 0.7 and re-introduce the new in 1.0, or do we directly make the breaking change with a NEWS.md item? (I favor the latter)

StefanKarpinski · 2017-12-14T16:01:01Z

I think we can directly make the change with a news item. Seems unlikely to cause that much trouble. (Famous last words)

StefanKarpinski · 2017-12-14T19:47:34Z

From triage, please do rebase this and merge once CI passes.

rfourquet · 2017-12-14T19:57:16Z

I can't run benchmarks here yet (BaseBenchmarks PR not merged yet), but if I remember correctly, this was also giving nice performance improvements!

rfourquet · 2017-12-19T20:18:52Z

@nanosoldier runbenchmarks("collection", vs=":master")

rfourquet · 2017-12-19T20:24:08Z

After 4 months, I'm finally able to run Nanosoldier :)
Few "regressions" are expected:

symdiff(::Union{BitSet,Set}) makes a copy, when it was returning the argument (no-op)
symdiff(::Vector) makes a copy and some work, when it was returning the argument (no-op)
intersect(::Vector, ...) : does not do the same work: before it was roughly filtering the vector on a predicate checking the appartenance to the set of all args but the first; now does a more set-like operation.

Let's see if I missed something. Should be good to go otherwise.

rfourquet · 2017-12-19T22:07:54Z

@nanosoldier runbenchmarks("collection", vs=":master")

rfourquet · 2017-12-19T23:36:05Z

@nanosoldier runbenchmarks("collection", vs=":master")

nanosoldier · 2017-12-20T00:45:48Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

StefanKarpinski · 2017-12-20T05:34:16Z

Some of those slowdowns are pretty large – it sounds like that might be correct and expected. If so, please go for it, but I'll let you judge if they're expected or not, @rfourquet.

rfourquet · 2017-12-20T09:43:25Z

So no bad surprise, all those slowdowns are expected. Besides the no-op case, only intersect on a Vector is slower: but before, you would get a sort of multiset, ie. with non-uniqued values in the set, which was really making intersect(::Vector) the odd one out; to get a set-like vector, you would have to call unique! on it. Moreover, I made a "mistake" while writing those benchmarks, if my goal was (and it was!) to support this PR: indeed consider intersect(v::Vector, itr): the old code would test, for each element x of v, whether x is element of itr. Unless itr is a set-like object (ot at least has an efficient in), this will be linear complexity for each element of v ! whereas in the new code you use an intermediate Set and do a single pass over itr (and itr2 etc. if more arguments). But the benchmark intersects the vector only with small collections, where the linear behavior doesn't matter so much, on the contrary. So even when calling unique!, old code will be faster in some (corner) cases, but for the general case, the new code has a much better algorithmic complexity. My benchmark fails unfortunatly to exhibit this. For example, for something like a = [1:10000;]; intersect(a, a, a) is about 30 times faster in the PR. But to be honest, there are still cases where I could make the code more clever depending on the arguments, to try to close the gap where the old code was faster.

So let's not let this shadow all the other amazing perf improvements from this benchmark report and enjoy the new extended functionality!

StefanKarpinski · 2017-12-20T10:46:31Z

Excellent. Thank you!

rfourquet added the collections Data structures holding multiple items, e.g. sets label Aug 31, 2017

rfourquet mentioned this pull request Sep 1, 2017

implement union!, intersect!, setdiff! for non-sets #23524

Closed

rfourquet added breaking This change will break code potential benchmark Could make a good benchmark in BaseBenchmarks needs news A NEWS entry is required for this change labels Sep 2, 2017

rfourquet added the triage This should be discussed on a triage call label Dec 13, 2017

StefanKarpinski added this to the 1.0 milestone Dec 14, 2017

StefanKarpinski removed the triage This should be discussed on a triage call label Dec 14, 2017

rfourquet added 2 commits December 19, 2017 13:34

add few missing methods for set-like operations

35d280d

fix perf regression in filter/filter!

de8774a

rfourquet force-pushed the rf/few-set-methods branch 2 times, most recently from ca4bdf2 to a74110c Compare December 19, 2017 20:12

add mutating set-like operations for AbstractVector

6bf6973

rfourquet force-pushed the rf/few-set-methods branch from a74110c to 6bf6973 Compare December 19, 2017 21:43

rfourquet merged commit 7b1c06a into master Dec 20, 2017

rfourquet deleted the rf/few-set-methods branch December 20, 2017 09:43

rfourquet mentioned this pull request Dec 29, 2017

setdiff!(a,b) slower than setdiff(a,b) when b is large (and viceversa when a is large) #25317

Closed

ajkeller34 mentioned this pull request Dec 31, 2017

union of tuples is broken #25338

Closed

ararslan mentioned this pull request Jan 29, 2018

intersect of empty tuple is broken #25801

Closed

ararslan mentioned this pull request Mar 22, 2018

Fixed keys() and findfirst() JuliaCollections/DataStructures.jl#374

Merged

guilhermeleobas mentioned this pull request Jun 8, 2018

[RFC] Specialize setdiff! for sets #27486

Closed

vtjnash mentioned this pull request Jul 7, 2018

Benchmarks vs 0.6 in prep for 0.7 release [do not merge] #27030

Closed

KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018

rfourquet mentioned this pull request Feb 5, 2020

make setdiff not call setdiff! ? #34675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add few missing methods for set-like operations #23528

add few missing methods for set-like operations #23528

rfourquet commented Aug 31, 2017 •

edited

Loading

rfourquet commented Sep 2, 2017 •

edited

Loading

rfourquet commented Sep 5, 2017

TotalVerb commented Sep 5, 2017

rfourquet commented Sep 5, 2017 •

edited

Loading

JeffBezanson commented Dec 14, 2017

rfourquet commented Dec 14, 2017 •

edited

Loading

rfourquet commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

rfourquet commented Dec 14, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

nanosoldier commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017

rfourquet commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017

add few missing methods for set-like operations #23528

add few missing methods for set-like operations #23528

Conversation

rfourquet commented Aug 31, 2017 • edited Loading

rfourquet commented Sep 2, 2017 • edited Loading

rfourquet commented Sep 5, 2017

TotalVerb commented Sep 5, 2017

rfourquet commented Sep 5, 2017 • edited Loading

JeffBezanson commented Dec 14, 2017

rfourquet commented Dec 14, 2017 • edited Loading

rfourquet commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

rfourquet commented Dec 14, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

rfourquet commented Dec 19, 2017

nanosoldier commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017

rfourquet commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017

rfourquet commented Aug 31, 2017 •

edited

Loading

rfourquet commented Sep 2, 2017 •

edited

Loading

rfourquet commented Sep 5, 2017 •

edited

Loading

rfourquet commented Dec 14, 2017 •

edited

Loading