Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JLD slow/inefficient for Graphs.jl graphs #208

Closed
sbromberger opened this issue Jan 12, 2015 · 5 comments
Closed

JLD slow/inefficient for Graphs.jl graphs #208

sbromberger opened this issue Jan 12, 2015 · 5 comments

Comments

@sbromberger
Copy link

Inspired by discussion at JuliaAttic/OldGraphs.jl#37 (comment) et seq., I'd like to help troubleshoot the following (but don't know where to begin to collect data that would be useful):

julia> using GraphCentrality

julia> g = GraphCentrality.readgraph("$(Pkg.dir("GraphCentrality"))/test/testdata/graph-500-50000.csv")
Directed Graph (500 vertices, 50000 edges)

julia> using HDF5, JLD

julia> @time save("mygraph500-50k-c.jld", "g", g, compress=true)
elapsed time: 2.460565192 seconds (125599788 bytes allocated, 3.18% gc time)

shell> ls -l mygraph500-50k-c.jld
-rw-r--r--  1 seth  staff  2982936 Jan 12 09:06 mygraph500-50k-c.jld

vs

julia> @time GraphCentrality.randgraph(500,50000, compress=true)
elapsed time: 0.07615032 seconds (28392044 bytes allocated)
(500,50000)

shell> ls -l graph-500-50000.csv.gz
-rw-r--r--  1 seth  staff  161623 Jan 12 09:08 graph-500-50000.csv.gz

Note that randgraph both generates AND saves the graph structure to disk.

@simonster
Copy link
Member

There's basically no efficient general way to save objects with references in HDF5, at least using libhdf5. See my comment here.

@sbromberger
Copy link
Author

Ah, ok - thanks very much for the quick response. I'm assuming this means that a graph-specific persistence mechanism is probably the best way of doing this right now? (I confess to knowing next to nothing about HDF5.)

@timholy
Copy link
Member

timholy commented Jan 12, 2015

@sbromberger, some clues about a possible approach might be gleaned by reading through #191. I'll put this on my list to finish that up (addressing @simonster's excellent suggestions, which are a better place to start than my code) and write the relevant docs.

@simonster
Copy link
Member

I went back and looked at this as part of my work on JLD2. It looks like most of the time above for JLD is actually compilation time. After warmup, I get:

julia> g = @time GraphCentrality.readgraph("graph-500-50000.csv");
  65.953 milliseconds (706 k allocations: 44485 KB, 13.10% gc time)

julia> @time save("mygraph500-50k-c.jld", "g", g, compress=true)
  51.167 milliseconds (21240 allocations: 4629 KB)

julia> @time @load "mygraph500-50k-c.jld" g;
  46.615 milliseconds (22681 allocations: 8126 KB, 19.19% gc time)

So after warmup, it seems that JLD can beat readgraph. But JLD2 (which is still a WIP and not fully optimized nor guaranteed to work for everything) is substantially faster:

julia> @time (f = JLD2.jldopen("test.jld", "w"); write(f, "a", g); close(f))
   7.232 milliseconds (4635 allocations: 210 KB)

julia> @time (f = JLD2.jldopen("test.jld", "r"); read(f, "a"); close(f))
  11.700 milliseconds (21015 allocations: 4590 KB)

If I compare to serialize/deserialize:

julia> @time (f = open("test.jls", "w"); serialize(f, g); close(f))
   5.734 milliseconds (4096 allocations: 401 KB)

julia> @time (f = open("test.jls", "r"); deserialize(f); close(f))
   6.129 milliseconds (7135 allocations: 3924 KB)

So JLD2 is within a factor of 2, at least. The file sizes are comparable for JLD2 and serialize, but both are much larger than the csv:

➜  ~ ls -l test.jld
-rw-r--r--  1 simon  staff  3671119 Jul 27 10:41 test.jld
➜  ~ ls -l test.jls
-rw-r--r--  1 simon  staff  3635354 Jul 27 10:50 test.jls
➜  ~ ls -la graph-500-50000.csv 
-rw-r--r--  1 simon  staff  428321 Jul 27 10:33 graph-500-50000.csv

With additional optimization JLD2 can probably be even faster, but if you want to bring the file size down, then custom serialization will probably be the only way.

@kleinhenz
Copy link
Contributor

closing since JLD is no longer part of HDF5.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants