Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increduously bad performance on when saving this object. #65

Closed
Skylion007 opened this issue May 5, 2016 · 7 comments
Closed

Increduously bad performance on when saving this object. #65

Skylion007 opened this issue May 5, 2016 · 7 comments

Comments

@Skylion007
Copy link

Skylion007 commented May 5, 2016

So I found this useful repository that allows you to compute an Simon Funk SVD in Julia. As such, I created a massive one using 170 million user rankings and it worked beautifully taking about three hours to converge. To save times in future runs, I thought it would be a great idea export the "model" object yielded from the train function to the JLD format. So I did. An hour and a half ago... So far JLD has allocated an approximately 800MB file to save it in but has only saved 750MB and it appears to be slowing down... and taking longer and longer to fully fill the file. Also, my resource monitor (Windows 10) is reporting 300MB/s of SSD access from Julia despite the fact that if that were the case the file would have been written in 3 seconds. I am little confused about why the performance is so bad. I mean I suffered slow performance when reading the CSVs into from my dataset, but WOW. That was 5GBs of data and it took 10 minutes. As such, it take significantly under an hour to save 800MB. Any suggestions on what could be causing the increduously bad performance? I just installed Julia today, running the latest version on Windows 10 64bit.

@timholy
Copy link
Member

timholy commented May 5, 2016

Probably JuliaIO/HDF5.jl#170. See if changing the HDF5 format helps; if not, tell us more about the structure of what you're trying to save.

@Skylion007
Copy link
Author

Skylion007 commented May 5, 2016

I see is there any I can use JLD to save the format better then? I unfortunately have very little experience with HDF5. How would I go about saving it more efficiently then? The definition for the item I am trying to save can be found here. I'll also post it at the bottom this message for good measure.

Also this seems like a serious bug with JLD. If the performance scales so badly why not have JLD save it in the newer format. JLD can read the old format just fine, but should also be able to read the new format so not having a way to save data using the new format in JLD is a terrible proposition.


# An SVD model.
type RatingsModel
    # An index between human-readable user names and an interval of integer ids.
    user_to_index::Dict{AbstractString, Int32}
    # An index between human-readable item names and an interval of integer ids.
    item_to_index::Dict{AbstractString, Int32}
    # U, S, and V form the SVD decomposition of the original matrix, so
    # U * diagm(S) * V' will yield the model's approximation of the original
    # matrix. U is a (number of users) x (number of features) matrix, S is
    # a list of (number of features) singular values, and V is a (number of
    # items) x (number of features) matrix.
    U::Array{Float32,2}
    S::Array{Float32,1}
    V::Array{Float32,2}
end

@Skylion007
Copy link
Author

Skylion007 commented May 5, 2016

@timholy I am unable to see how I can change the format to store matrices in HDF5 or in JLD? How would I serialize the objects otherwise? I'd also like to confirm whether or not this is a Do Not Fix issue and I will have to use HDF5 to save the file?

@timholy
Copy link
Member

timholy commented May 5, 2016

I see is there any I can use JLD to save the format better then?

Did you read this comment? JuliaIO/HDF5.jl#170 (comment).

Also this seems like a serious bug with JLD. If the performance scales so badly why not have JLD save it in the newer format. JLD can read the old format just fine, but should also be able to read the new format so not having a way to save data using the new format in JLD is a terrible proposition.

First, these are all issues with the underlying C library, not JLD. The question is whether JLD can work around the problems in the library. Second, I've never been affected by the bug, and for those who have, no one has ever bothered to report back about whether this fixes it. Consequently, it's rather hard to know whether to apply the fix. Third, we could just do it anyway, but there's a downside: suppose I have two machines, one with libhdf5 1.6 and the other with libhdf5 1.10. The one with the new lib will be able to read both, but the older machine won't be able to read files created by the newer machine unless the older format was used.

So, rather than complain, you have the opportunity to help out 😄: try it and let us know if it works. If it does, it's a powerful argument to make the change despite the fact that it's 100% certain to annoy some people.

I am unable to see how I can change the format to store matrices in HDF5 or in JLD?

That issue also linked to https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#custom-serialization. The likely problem is that you're storing a bunch of small objects, and that's the source of libhdf5's performance problem. Instead, assuming you're saving a Vector{RatingsModel}, upon saving/loading change the format and pack all the U arrays into one giant Array{Float32,3} (so the kth ratings model is Ubig[:,:,k]), and similarly for all the S and V. I do this frequently, and it often saves two orders of magnitude on the IO time.

@timholy
Copy link
Member

timholy commented May 9, 2016

Did setting libver_bounds fix the problems?

@Skylion007
Copy link
Author

I tried a newer version of the library and the issue seems to have been fixed. Thanks for all your help.

@timholy
Copy link
Member

timholy commented Jul 7, 2016

Yay, thanks for testing an letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants