Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add few missing methods for set-like operations #23528

Merged
merged 3 commits into from
Dec 20, 2017
Merged

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Aug 31, 2017

For intersect, union, setdiff and symdiff, and their mutating variants for Set and IntSet, there is now support for arbitrary many arguments consistently. What breaks is for one-argument versions: for example we had s=Set(); union(s) !== s but symdiff(s) === s. I chose here to always return a copy (this should be the only regression in terms of performance). Also, I realize that having setdiff accepting more or less than two arguments may be controversial; I find it natural, but can remove this bit if requested.
This should not get merged until nanosoldier runs the corresponding benchmarks.

PS: nice to have those improvements with a net decrease of LOC! (in particular when there are 28 more lines for tests)

EDIT: also, this takes care of improving the complexity of some operations mentioned here.
EDIT: this also adds filter/filter! for IntSet returning an IntSet (like for Set and Dict).

@rfourquet rfourquet added the collections Data structures holding multiple items, e.g. sets label Aug 31, 2017
@rfourquet
Copy link
Member Author

rfourquet commented Sep 2, 2017

I now added mutating operations also for AbstractVector. Morevover:

  1. intersect was the only operation among the 4 that was preserving the multiplicity; I thought at first it was a bug, until I discovered the specification in the docstring (but it was not tested in the test folder). I allowed myself to remove this inconsistent special case as it's seems arbitrary and error-prone: all set-like operations now return unique'd vectors.
  2. It became clear that for performance, setdiff! must accept arbitrary number of arguments: a Set is constructed in the process, and foldl(v, setdiff!, itrs) would create unnecessarily too many of them.
  3. the semantics of symdiff for IntSet and arrays were not consistent regarding multiplicity in the arguments:
julia> symdiff(IntSet([]), [1, 1])
IntSet([])

julia> symdiff([], [1, 1])
1-element Array{Any,1}:
 1

I harmonized the behaviors in favor of the first, which is the most natural to implement, and the more general as it's possible to call unique on the arguments to get the second behavior.

@rfourquet rfourquet added breaking This change will break code potential benchmark Could make a good benchmark in BaseBenchmarks needs news A NEWS entry is required for this change labels Sep 2, 2017
@rfourquet
Copy link
Member Author

Someone against this new behavior? I will otherwise implement the deprecations for the breaking parts.

@TotalVerb
Copy link
Contributor

Does the increased generality of non-uniquing operations apply in all cases though, and argue for vector union and intersect to treat them as multisets?

@rfourquet
Copy link
Member Author

rfourquet commented Sep 5, 2017

Does the increased generality of non-uniquing operations apply in all cases though, and argue for vector union and intersect to treat them as multisets?

Very good remark! it could be true, but I guess that in order to implement these semantics efficiently, it would necessitate to implement a real MultiSet type... Moreover it's already non-negligible work to implement just set semantics for vectors while keeping order; I wonder how difficult/easy it is to mix this order property with multisets semantics.

So in short, what I did come up with is the most pragmatic solution I could find: treating vectors as sets as much as possible, while keeping consistency (when possible) with operations on real sets, and if possible keeping the implementation straightforward (if the implementation has to become too complex for those functionalities, it may be a sign that it belongs to a package). But there may be a better compromise.

EDIT: BTW, the master's version of intersect on vectors, which maintains "multiplicity of the first argument", does still not implement what would be a mathematical intersection of multisets.

@rfourquet rfourquet added the triage This should be discussed on a triage call label Dec 13, 2017
@JeffBezanson
Copy link
Member

Sounds good but doesn't need to be on the milestone I think?

@rfourquet
Copy link
Member Author

rfourquet commented Dec 14, 2017

but doesn't need to be on the milestone I think?

It's a breaking change for intersect and symdiff, see my post above (I hope not many people relied on the old behavior), so I would say it needs a decision before feature freaze.

@rfourquet
Copy link
Member Author

Also, if we make this change: do we first deprecate the relevant methods in 0.7 and re-introduce the new in 1.0, or do we directly make the breaking change with a NEWS.md item? (I favor the latter)

@StefanKarpinski
Copy link
Member

I think we can directly make the change with a news item. Seems unlikely to cause that much trouble. (Famous last words)

@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Dec 14, 2017
@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Dec 14, 2017
@StefanKarpinski
Copy link
Member

From triage, please do rebase this and merge once CI passes.

@rfourquet
Copy link
Member Author

I can't run benchmarks here yet (BaseBenchmarks PR not merged yet), but if I remember correctly, this was also giving nice performance improvements!

@rfourquet rfourquet force-pushed the rf/few-set-methods branch 2 times, most recently from ca4bdf2 to a74110c Compare December 19, 2017 20:12
@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("collection", vs=":master")

@rfourquet
Copy link
Member Author

After 4 months, I'm finally able to run Nanosoldier :)
Few "regressions" are expected:

  • symdiff(::Union{BitSet,Set}) makes a copy, when it was returning the argument (no-op)
  • symdiff(::Vector) makes a copy and some work, when it was returning the argument (no-op)
  • intersect(::Vector, ...) : does not do the same work: before it was roughly filtering the vector on a predicate checking the appartenance to the set of all args but the first; now does a more set-like operation.

Let's see if I missed something. Should be good to go otherwise.

@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("collection", vs=":master")

1 similar comment
@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("collection", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@StefanKarpinski
Copy link
Member

Some of those slowdowns are pretty large – it sounds like that might be correct and expected. If so, please go for it, but I'll let you judge if they're expected or not, @rfourquet.

@rfourquet
Copy link
Member Author

So no bad surprise, all those slowdowns are expected. Besides the no-op case, only intersect on a Vector is slower: but before, you would get a sort of multiset, ie. with non-uniqued values in the set, which was really making intersect(::Vector) the odd one out; to get a set-like vector, you would have to call unique! on it. Moreover, I made a "mistake" while writing those benchmarks, if my goal was (and it was!) to support this PR: indeed consider intersect(v::Vector, itr): the old code would test, for each element x of v, whether x is element of itr. Unless itr is a set-like object (ot at least has an efficient in), this will be linear complexity for each element of v ! whereas in the new code you use an intermediate Set and do a single pass over itr (and itr2 etc. if more arguments). But the benchmark intersects the vector only with small collections, where the linear behavior doesn't matter so much, on the contrary. So even when calling unique!, old code will be faster in some (corner) cases, but for the general case, the new code has a much better algorithmic complexity. My benchmark fails unfortunatly to exhibit this. For example, for something like a = [1:10000;]; intersect(a, a, a) is about 30 times faster in the PR. But to be honest, there are still cases where I could make the code more clever depending on the arguments, to try to close the gap where the old code was faster.

So let's not let this shadow all the other amazing perf improvements from this benchmark report and enjoy the new extended functionality!

@rfourquet rfourquet merged commit 7b1c06a into master Dec 20, 2017
@rfourquet rfourquet deleted the rf/few-set-methods branch December 20, 2017 09:43
@StefanKarpinski
Copy link
Member

Excellent. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code collections Data structures holding multiple items, e.g. sets potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants