Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added unique!(f, itr) function #30141

Merged
merged 5 commits into from
Dec 5, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 35 additions & 6 deletions base/set.jl
Original file line number Diff line number Diff line change
Expand Up @@ -178,16 +178,41 @@ function unique(f, C)
out
end

# If A is not grouped, then we will need to keep track of all of the elements that we have
# seen so far.
function _unique!(A::AbstractVector)
seen = Set{eltype(A)}()
"""
unique!(f, A::AbstractVector)

Selects one value from `A` for each unique value produced by `f` applied to
elements of `A` , then return the modified A.

# Examples
```jldoctest
julia> unique!(x -> x^2, [1, -1, 3, -3, 4])
3-element Array{Int64,1}:
1
3
4

julia> unique!(n -> n%3, [5, 1, 8, 9, 3, 4, 10, 7, 2, 6])
3-element Array{Int64,1}:
5
1
9

julia> unique!(iseven, [2, 3, 5, 7, 9])
2-element Array{Int64,1}:
2
3
```
"""
function unique!(f, A::AbstractVector)
seen = Set()
Copy link
Member

@andyferris andyferris Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this won't be efficient because this means Set{Any}().

Since relying on inference is naughty, what should probably happen here is a widening pattern. Start with a Set{typeof(first(A))}() and then keep pushing to this set until the element type isn't wide enough. Someone else might know better, but I think the most efficient pattern is then to dispatch again to more-or-less the same function. This is, unfortunately, a little bit of work to implement, sorry.

You'll want an inner function like _unique!(f, A, seen, i) where seen is the current Set and i is the current index The outer unique! function is creates the initial Set and dispatches to the inner function.

i.e. something like

function unique!(f, A::AbstractArray)
    if length(A) <= 1
        return A
    end

    i = firstindex(A)
    t = @inbounds A[i]
    seen = Set{typeof(x)}()
    push!(seen, t)
    return _unique!(f, A, seen, 1, i+1)
end

and the other funciton approximately

function _unique!(f, A::AbstractArray, seen::Set, count::Integer, i::Integer)
    while i <= lastindex(A)
        t = @inbounds A[i]
        if t  seen
            count += 1
            @inbounds A[count] = t
            if typeof(t) <: eltype(seen)
                push!(seen, t)
            else
                seen2 = convert(Set{promote_typejoin(eltype(seen), typeof(t))}, seen)
                push!(seen2, t)
                return _unique!(f, A, seen2, count, i+1)
            end
        end
        i += 1
    end
    return resize!(A, count)
end

I haven't tested the above at all, but I hope it demonstrates the pattern at least. The compiler will elide all the complexity in the standard case that f(first(A)) is inferrable. Also note that I've assumed all AbstractVectors have AbstractUnitRange indices in the above to avoid the need to use iterate, but the other pattern is fine too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree,
I also think that can be in a seperate PR.
Right now, the existing unique(f, iter) (no bang),
does the same thing.
And can be optimised in the same way.

This PR adds a naive implementation that can be improved later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this could be implemented in a different PR as both with and without bang version of unique(f, iter) uses naive method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyferris I can try to implement the inner function _unique within this PR if this is what is needed to merge it but will need some help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I didn’t mean to block this - iterating later is of course fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyferris you might want to dismiss your review,
However that is done,
(possibly by rereviewing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyferris Please re review this PR.

idxs = eachindex(A)
y = iterate(idxs)
count = 0
for x in A
if x ∉ seen
push!(seen, x)
t = f(x)
if t ∉ seen
push!(seen,t)
count += 1
A[y[1]] = x
y = iterate(idxs, y[2])
Expand All @@ -196,6 +221,10 @@ function _unique!(A::AbstractVector)
resize!(A, count)
end

# If A is not grouped, then we will need to keep track of all of the elements that we have
# seen so far.
_unique!(A::AbstractVector) = unique!(identity, A::AbstractVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the call to unique! here will now use a Set{Any} this will decrease regular unique!(f,A)s performance, and I think Andy's objection should be fixed in this PR, since this PR not only adding a new feature that can be optimized later, it makes performance of existing code worse. It would at least be good if you could present some benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try to build the inner _unique function as suggested by @andyferris.In the meantime I will benchmark the existing solution and will present it here.

Copy link
Contributor Author

@raghav9-97 raghav9-97 Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredrikekre When I run @btime unique!(x->2x, 1:10_000); it says
setindex! not defined for UnitRange{Int64}
I couldn’t understand what the error message says?
Same problem occurs with the @benchmark macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because 1:10_000 is an immutable array, try instead @btime unique!(x->2x, $([1:10_000;]))

Copy link
Contributor Author

@raghav9-97 raghav9-97 Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some benchmarking results for unique!(f, iter) and unique!(iter) are stated:
@benchmark unique!(x->x^2, $([1:10_000;]))

  memory estimate:  786.97 KiB
  allocs estimate:  37981
  --------------
  minimum time:     995.317 μs (0.00% GC)
  median time:      1.050 ms (0.00% GC)
  mean time:        1.159 ms (3.08% GC)
  maximum time:     38.007 ms (96.42% GC)
  --------------
  samples:          4305
  evals/sample:     1

@benchmark unique!($([1:10_000;]))

  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.053 μs (0.00% GC)
  median time:      38.081 μs (0.00% GC)
  mean time:        41.537 μs (0.00% GC)
  maximum time:     10.190 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Copy link
Contributor

@oxinabox oxinabox Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I lead you down the wrong path with my coimment on #30141 (comment)

Replacing _unique!(iter) with unique!(identity, iter) would only have (certainly) been optimised away if the Set{Any} had not been needed.

Can you also report benchmarks for julia master for the old performance of unique!($([1:10_000;])), without this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a branch, ajf/faster_unique_f. Completely untested, but I'll finish it up.

@fredrikekre I was thinking it would be better to have small PRs with seperate concerns, as Lyndon suggests. Let's simply merge this feature PR from @raghav9-97 (his first Julia PR if I understand) and I'll follow up straight away with the performance PR that addresses both unique and unique!. Are you OK with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, this particular line does leave us in an bad intermediate state... oh well.


# If A is grouped, so that each unique element is in a contiguous group, then we only
# need to keep track of one element at a time. We replace the elements of A with the
# unique elements that we see in the order that we see them. Once we have iterated
Expand Down
4 changes: 4 additions & 0 deletions test/sets.jl
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,10 @@ end
unique!(u)
@test u == [5,"w","we","r"]
u = [1,2,5,1,3,2]
@test unique!(x -> x ^ 2, [1, -1, 3, -3, 5, -5]) == [1, 3, 5]
@test unique!(n -> n % 3, [5, 1, 8, 9, 3, 4, 10, 7, 2, 6]) == [5, 1, 9]
@test unique!(iseven, [2, 3, 5, 7, 9]) == [2, 3]
@test unique!(x -> x % 2 == 0 ? :even : :odd, [1, 2, 3, 4, 2, 2, 1]) == [1, 2]
end

@testset "allunique" begin
Expand Down