Hint on the number of tags in tagmap in Julia #387

Tortar · 2023-10-29T21:11:35Z

Also removed the T, being a bit more explicit.

(the rules state that there are max 100 tags)

@Moelf

Moelf · 2023-10-29T21:27:23Z

Hmm, I don't like you're hard coding UInt32 so many times, why not just keep T?

Moelf · 2023-10-29T21:27:50Z

Also, sizehint on dictionary doesn't work the way you want it to

Tortar · 2023-10-29T21:32:00Z

I think that having T when it always fixed to be UInt32 is not useful, it is like using T = Int in my opinion. I don't see much of usefulness of keeping T since there is no better type we can use anyway, and so we will probably never change it anymore

Tortar · 2023-10-29T21:33:49Z

I used sizehint! on Dictionaries sometime and it did improve performance, in this case though it is not very useful since the tags are just 100

Tortar · 2023-10-29T21:39:07Z

Also (this is incredible, bug??) if I use T=UInt32 in the concurrent version:

Concurrent version with T=UInt32

module RelatedCon

using JSON3, Dates, StaticArrays, ChunkSplitters
using Base.Threads: @threads, nthreads

export main

const topn = 5

struct PostData
    _id::String
    title::String
    tags::Vector{String}
end

struct RelatedPost
    _id::String
    tags::Vector{String}
    related::SVector{topn, PostData}
end


function fastmaxindex!(xs::Vector{T}, topn, maxn, maxv) where {T}
    maxn .= one(T)
    maxv .= zero(T)
    top = maxv[1]
    for (i, x) in enumerate(xs)
        if x > top
            maxn[1], maxv[1] = i, x
            for j in 2:topn
                if maxv[j-1] > maxv[j]
                    maxv[j-1], maxv[j] = maxv[j], maxv[j-1]
                    maxn[j-1], maxn[j] = maxn[j], maxn[j-1]
                end
            end
            top = maxv[1]
        end
    end
    reverse!(maxn)
    return
end

function related(posts)
    T = UInt32
    # key is every possible "tag" used in all posts
    # value is indicies of all "post"s that used this tag
    tagmap = Dict{String,Vector{T}}()
    for (idx, post) in enumerate(posts)
        for tag in post.tags
            tags = get!(() -> T[], tagmap, tag)
            push!(tags, idx)
        end
    end

    relatedposts = Vector{RelatedPost}(undef, length(posts))

    @threads for (postsrange, _) in chunks(posts, nthreads())
        topn = 5
        maxn = MVector{topn,T}(undef)
        maxv = MVector{topn,T}(undef)
        taggedpostcount = Vector{T}(undef, length(posts))

        for i in postsrange
            post = posts[i]

            taggedpostcount .= zero(T)
            # for each post (`i`-th)
            # and every tag used in the `i`-th post
            # give all related post +1 in `taggedpostcount` shadow vector
            for tag in post.tags
                for idx in tagmap[tag]
                    taggedpostcount[idx] += one(T)
                end
            end
            # don't self count
            taggedpostcount[i] = zero(T)

            fastmaxindex!(taggedpostcount, topn, maxn, maxv)

            relatedpost = RelatedPost(post._id, post.tags, SVector{topn}(@view posts[maxn]))
            relatedposts[i] = relatedpost
        end
    end
    return relatedposts
end

function main()
    json_string = read(@__DIR__()*"/../../../posts.json", String)
    posts = JSON3.read(json_string, Vector{PostData})
    fake_posts = fill(posts[1], length(posts))
    related(fake_posts) #warmup

    start = now()
    all_related_posts = related(posts)
    println("Processing time (w/o IO): $(now() - start)")

    open(@__DIR__()*"/../../../related_posts_julia_con.json", "w") do f
        JSON3.write(f, all_related_posts)
    end
end

end # module RelatedCon

Julia becomes 200x slower D:

Moelf · 2023-10-29T21:40:54Z

If we use T at least you don't have to change 10 different code when we need to change

Tortar · 2023-10-29T21:43:05Z

yes, indeed this is a good reason to keep it, will restore it then, but how that concurrent version is so slow? Does it happen also on your computer?

Moelf · 2023-10-29T21:57:48Z

idk, I'm gonna maybe get a Azure VM tmr...

Tortar · 2023-10-29T22:01:12Z

But could you please confirm me though that you see that slowdown on your machine? Seems something to create a MWE to report

My specs:

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 12 on 12 virtual cores

Tortar · 2023-10-29T22:05:50Z

idk, I'm gonna maybe get a Azure VM tmr...

This would probably be helpful, I have to say that I see a lot of variance on my computer as well, e.g. from the last @jinyus benchmark at #386 (comment)

    Benchmark 1: julia -O3 --project=Related -e "using Related; main()"
    Processing time (w/o IO): 30 milliseconds
    Processing time (w/o IO): 23 milliseconds
    Processing time (w/o IO): 22 milliseconds
    Processing time (w/o IO): 30 milliseconds
    Processing time (w/o IO): 24 milliseconds
    Processing time (w/o IO): 30 milliseconds
    Processing time (w/o IO): 30 milliseconds
    Processing time (w/o IO): 30 milliseconds # why so much oscillations?
    Processing time (w/o IO): 23 milliseconds # why so much oscillations?
    Processing time (w/o IO): 23 milliseconds
    Processing time (w/o IO): 23 milliseconds
    Processing time (w/o IO): 23 milliseconds
    Processing time (w/o IO): 23 milliseconds
      Time (mean ± σ):      3.357 s ±  0.026 s    [User: 3.208 s, System: 0.253 s]
      Range (min … max):    3.321 s …  3.393 s    10 runs

Tortar · 2023-10-29T22:06:29Z

But I see the same oscillations on my computer as well though, and interestingly the oscillation happen much more between different Julia sessions (e.g. if I restart Julia and then repeat the benchmark I can get somewhat different results)

Tortar · 2023-10-29T22:20:08Z

Now it adds only -O3 + sizehint! which should be okay

Tortar · 2023-10-29T23:48:30Z

anyway that problem with T=UInt32 is probably an instance of JuliaLang/julia#15276

Moelf · 2023-10-30T08:22:12Z

#182

this is known

Tortar · 2023-10-30T13:08:55Z

Let me know if you think it is worth to see this merged @Moelf

Moelf · 2023-10-30T13:19:11Z

I'm neutral, I kinda want to remove -O3 honestly

Tortar · 2023-10-30T14:04:05Z

the hint seems good to me to keep so I think it is ready to be merged @jinyus

Tortar · 2023-10-30T14:06:36Z

(I removed -O3 also in Related.jl)

jinyus · 2023-10-30T16:01:58Z

main:
Julia | 28.31 ms | 337.00 ms | 2.92 s

This PR:

Julia:

        Benchmark 1: julia --project=Related -e "using Related; main()"
        Processing time (w/o IO): 23 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 24 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 25 milliseconds
        Processing time (w/o IO): 24 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 31 milliseconds
        Processing time (w/o IO): 23 milliseconds
        Processing time (w/o IO): 30 milliseconds
          Time (mean ± σ):      3.230 s ±  0.026 s    [User: 3.090 s, System: 0.243 s]
          Range (min … max):    3.194 s …  3.268 s    10 runs
         
Julia:

        Benchmark 1: julia --project=Related -e "using Related; main()"
        Processing time (w/o IO): 481 milliseconds
        Processing time (w/o IO): 363 milliseconds
        Processing time (w/o IO): 363 milliseconds
          Time (mean ± σ):      3.727 s ±  0.025 s    [User: 3.516 s, System: 0.312 s]
          Range (min … max):    3.709 s …  3.744 s    2 runs
         
Julia:

        Benchmark 1: julia --project=Related -e "using Related; main()"
        Processing time (w/o IO): 4221 milliseconds
        Processing time (w/o IO): 3162 milliseconds
        Processing time (w/o IO): 3164 milliseconds
          Time (mean ± σ):      6.839 s ±  0.001 s    [User: 6.605 s, System: 0.329 s]
          Range (min … max):    6.838 s …  6.840 s    2 runs

Moelf · 2023-10-31T11:27:32Z

for me without sizehint!() is actually faster (Azure VM + Docker), but I'm sure it's just fluctuation, I still support removing O3

Moelf · 2023-10-31T11:28:23Z

@jinyus maybe it's time to consider use minimum instead of mean as the proxy metric because Virtual CPUs are so unstable.
https://blog.kevmod.com/2016/06/10/benchmarking-minimum-vs-average/

In particular, when running on vCPU, we're very likely to get unlucky big time, but we will never be too lucky since the program is deterministic, so minimum should be a better proxy

Tortar · 2023-11-01T03:10:27Z

I have to say @Moelf that I disagree, this seems like only a problem with Julia and to a much lesser extent NodeJS, look at the raw results of the last run https://github.com/jinyus/related_post_gen/pull/393/files#diff-2a2b2d627a2e6ead1faf8f7a575d8662c2dfbc73505e094767591470c213c01b, Julia performed badly since its results varied so much. The only other language with this kind of pattern is NodeJS, but the oscillations are less large. The cause of this is unclear to me. The only thing I can think of is that starting the processes again and again in Julia takes time, while in other languages doesn't, so between one run and the other there is a bigger delay. But still I don't really get the implications of such a reasoning actually.

Moelf · 2023-11-01T09:53:29Z

can we merge this since seems like -O3 did harm
:#393

Tortar · 2023-11-01T12:27:37Z

I would try too, but @jinyus showed the same strange timings patterns also for this pr. Let's try though

Tortar added 4 commits October 29, 2023 22:11

Hint on the number of tags in tagmap in Julia

1788202

Update RelatedCon.jl

02983b2

O3 also on concurrent version

869d2c4

Update run.sh

6b243f3

Tortar added 2 commits October 29, 2023 23:09

use T in related

b3c5bd7

Update Related.jl

7b3f6d8

remove -O3

5e9680f

remove -O3 also on Related

fd82f97

jinyus marked this pull request as draft October 30, 2023 16:02

Tortar marked this pull request as ready for review November 1, 2023 12:26

jinyus merged commit 59d93e4 into jinyus:main Nov 1, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hint on the number of tags in tagmap in Julia #387

Hint on the number of tags in tagmap in Julia #387

Tortar commented Oct 29, 2023 •

edited

Loading

Moelf commented Oct 29, 2023

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023

Tortar commented Oct 29, 2023 •

edited

Loading

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 •

edited

Loading

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023

Tortar commented Oct 29, 2023

Moelf commented Oct 30, 2023

Tortar commented Oct 30, 2023

Moelf commented Oct 30, 2023

Tortar commented Oct 30, 2023

Tortar commented Oct 30, 2023

jinyus commented Oct 30, 2023

Moelf commented Oct 31, 2023

Moelf commented Oct 31, 2023 •

edited

Loading

Tortar commented Nov 1, 2023 •

edited

Loading

Moelf commented Nov 1, 2023

Tortar commented Nov 1, 2023 •

edited

Loading

Hint on the number of tags in tagmap in Julia #387

Hint on the number of tags in tagmap in Julia #387

Conversation

Tortar commented Oct 29, 2023 • edited Loading

Moelf commented Oct 29, 2023

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 • edited Loading

Tortar commented Oct 29, 2023

Tortar commented Oct 29, 2023 • edited Loading

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 • edited Loading

Moelf commented Oct 29, 2023

Tortar commented Oct 29, 2023 • edited Loading

Tortar commented Oct 29, 2023 • edited Loading

Tortar commented Oct 29, 2023 • edited Loading

Tortar commented Oct 29, 2023

Tortar commented Oct 29, 2023

Moelf commented Oct 30, 2023

Tortar commented Oct 30, 2023

Moelf commented Oct 30, 2023

Tortar commented Oct 30, 2023

Tortar commented Oct 30, 2023

jinyus commented Oct 30, 2023

Moelf commented Oct 31, 2023

Moelf commented Oct 31, 2023 • edited Loading

Tortar commented Nov 1, 2023 • edited Loading

Moelf commented Nov 1, 2023

Tortar commented Nov 1, 2023 • edited Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Tortar commented Oct 29, 2023 •

edited

Loading

Moelf commented Oct 31, 2023 •

edited

Loading

Tortar commented Nov 1, 2023 •

edited

Loading

Tortar commented Nov 1, 2023 •

edited

Loading