Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow reading of long strings in v0.14 #742

Closed
chrisbrahms opened this issue Nov 18, 2020 · 7 comments
Closed

Very slow reading of long strings in v0.14 #742

chrisbrahms opened this issue Nov 18, 2020 · 7 comments

Comments

@chrisbrahms
Copy link

chrisbrahms commented Nov 18, 2020

I have some HDF5 files which contain very long strings (>300k characters). With HDF5 v0.13.6, I can read these no problem:

import Random
import HDF5

file = HDF5.h5open("string.h5")
file["string"] = Random.randstring(300000)
close(file)
import HDF5
using BenchmarkTools
@btime HDF5.h5open("string.h5") do file
    read(file["string"]);
end
593.301 μs (12 allocations: 586.38 KiB)

Updating to v0.14.1, however, even reading this same dataset just once takes so long that I haven't been able to measure it. For shorter strings (10k characters), I see a ~3-fold slowdown (700 μs vs 220 μs), but nothing so egregious.

All of this is on Windows 10, Julia 1.5.1

Any ideas?

@jmert
Copy link
Contributor

jmert commented Nov 18, 2020

I can confirm that on Linux with Julia master, I've had to kill the process after a bare read(file["string"]).

@jmert
Copy link
Contributor

jmert commented Nov 18, 2020

I think this boils down to a (probably bad) use of huge tuples — read on a fixed string essentially boils down to trying to allocate the following buffer:

julia> struct FixedArray{T,L}
           data::NTuple{L,T}
       end

julia> Array{FixedArray{Float64,30000}}(undef, 1)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT

Edit: If you leave the following running long enough, it'll eventually abort. (An error message with a stack trace scrolls by for a long time, so I'm not sure what the actual error was, but I'd guess stack overflow?)

julia> FixedArray{Float64, 30000}(ntuple(zero, 30000))

@jmert
Copy link
Contributor

jmert commented Nov 18, 2020

@kleinhenz It looks like the old read implementation never tried to realize the FixedArray type, just using it for it's type information.

HDF5.jl/src/HDF5.jl

Lines 1341 to 1364 in f4faf71

function read(obj::DatasetOrAttribute, ::Type{Array{A}}) where {A<:FixedArray}
T = eltype(A)
if !(T <: HDF5Scalar)
error("Sorry, not yet supported")
end
sz = size(A)
dims = size(obj)
data = Array{T}(undef,sz..., dims...)
nd = length(sz)
hsz = Hsize[sz[nd-i+1] for i = 1:nd]
memtype_id = h5t_array_create(hdf5_type_id(T), length(sz), hsz)
try
h5d_read(obj.id, memtype_id, H5S_ALL, H5S_ALL, obj.xfer.id, data)
finally
h5t_close(memtype_id)
end
ret = Array{Array{T}}(undef,dims)
# Because of garbage-collection concerns, it's best to copy the data
L = prod(sz)
for i = 1:prod(dims)
ret[i] = reshape(data[(i-1)*L+1:i*L], sz)
end
ret
end

It looks like that kind of behavior probably needs to be restored. Would just a special-case within the generic read implementation be a way to do that?

@kleinhenz
Copy link
Contributor

Yeah the problem is definitely the giant tuples. We can probably just add a special case to read like we have for opaque datatypes. This would fix it for the current case where you are just reading a fixed string dataset. You basically need the tuple approach to support fixed strings/arrays in compound datatypes though which is where this originally came from. It would be nice to have something like NTuple which didn't destroy the compiler but as far as I know there isn't really another solution.

@kleinhenz
Copy link
Contributor

Really what we want is a mechanism to be able to sometimes allocate a Vector{UInt8} to read into and defer constructing the final type.

@jmert
Copy link
Contributor

jmert commented Nov 19, 2020

Allocating the working buffer can be successfully done by reinterpreting a UInt8 array. Using Julia v1.6's reinterpret(reshape, ...) feature:

julia> struct FS{S}
           data::NTuple{S, UInt8}
       end

julia> a = reinterpret(reshape, FS{30000}, Array{UInt8}(undef, 30000));

julia> eltype(a)
FS{30000}

julia> length(a)
1

There are still major issues, though — trying to print a hangs the REPL.

In general, the large tuple appears to be a problem for inference/type dispatch, so you have to try really hard to hide the type. I made a bit of progress in getting the string to read and normalize by rewriting normalize_types to use Refs instead of bare types, but that bit of progress ends up breaking the generic compound data reading. (I either end up with broken dispatch or junk values, presumably because I'm missing GC.@preserve and/or copys in necessary places...)

jmert added a commit to jmert/HDF5.jl that referenced this issue Nov 19, 2020
jmert added a commit to jmert/HDF5.jl that referenced this issue Nov 19, 2020
@kleinhenz
Copy link
Contributor

See JuliaLang/julia#35619. I'm not sure if there is any prospect of this just getting fixed on the julia side.

jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 4, 2020
jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 5, 2020
jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 5, 2020
jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 7, 2020
@jmert jmert closed this as completed in 862d4f7 Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants