Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grab keys via callback iteration rather than many absolute calls #787

Merged
merged 3 commits into from
Dec 28, 2020

Conversation

jmert
Copy link
Contributor

@jmert jmert commented Dec 23, 2020

It appears there's significant overhead in directly obtaining the name
of every sub-element using a loop of h5?_get_name_by_idx calls.
For instance, a v7.3-formatted Matlab save file I have access to has
deeply nested and large struct which are serialized into a huge number
of datasets within a #refs# group.

julia> Int(length(fid["#refs#"]))
36236

Before this change, enumerating the keys takes almost a minute:

julia> @time keys(fid["#refs#"]);
 54.193914 seconds (72.47 k allocations: 3.594 MiB)

but with this change, that's reduced to a fraction of a second:

julia> @time keys(fid["#refs#"]);
  0.013353 seconds (36.24 k allocations: 1.383 MiB)

This is relevant in the context of #786, though it's not enough on its own.

It appears there's significant overhead in directly obtaining the name
of every sub-element using a loop of `h5?_get_name_by_idx` calls.
For instance, a v7.3-formatted Matlab save file I have access to has
deeply nested and large struct which are serialized into a huge number
of datasets within a `#refs#` group.

```julia
julia> Int(length(fid["#refs#"]))
36236
```

Before this change, enumerating the keys takes almost a minute:
```julia
julia> @time keys(fid["#refs#"]);
 54.193914 seconds (72.47 k allocations: 3.594 MiB)
```
but with this change, that's reduced to a fraction of a second:
```julia
julia> @time keys(fid["#refs#"]);
  0.013353 seconds (36.24 k allocations: 1.383 MiB)
```
@musm
Copy link
Member

musm commented Dec 23, 2020

LGTM.

Trying to figure out the v1.12 renaming issue is going to be tricky.

gen/bind_generator.jl Outdated Show resolved Hide resolved
docs/src/api_bindings.md Outdated Show resolved Hide resolved
gen/api_defs.jl Outdated Show resolved Hide resolved
gen/bind_generator.jl Outdated Show resolved Hide resolved
@musm
Copy link
Member

musm commented Dec 24, 2020

diff --git a/gen/gen_wrappers.jl b/gen/gen_wrappers.jl
index 8a53685..89ce993 100644
--- a/gen/gen_wrappers.jl
+++ b/gen/gen_wrappers.jl
@@ -52,9 +52,11 @@ for (mod, desc, urltail) in (
     ("H5TB", "Table Interface", "Tables"),
     )
     global apidocs
-    funcs = join(sort!(bound_api[mod]), "\n")
+    funcs = sort!(unique(bound_api[mod])) # unique removes duplicate methods (occurs when we have multiply defined methods due to hdf5 compatibility)
+    funcs = join(funcs, "\n```\n\n```julia\n")
     apidocs *= """
         ## [`$mod`](https://portal.hdfgroup.org/display/HDF5/$urltail) — $desc
+
         ```julia
         $funcs
         ```

I think it is better to give each method its own code block, which mimics how method docstrings are typically printed out.

@musm
Copy link
Member

musm commented Dec 24, 2020

Also these changes should make it easy to implement an efficient iteration protocol over keys.

gen/bind_generator.jl Outdated Show resolved Hide resolved
@musm
Copy link
Member

musm commented Dec 24, 2020

Alternative to calling unique in #787 (comment)
We might just want to add the filtering in the @bind macro generator phase

diff --git a/gen/bind_generator.jl b/gen/bind_generator.jl
index 0404b30..c27134e 100644
--- a/gen/bind_generator.jl
+++ b/gen/bind_generator.jl
@@ -147,7 +147,7 @@ function _bind(__module__, __source__, sig::Expr, err::Union{String,Expr,Nothing

     # Store the function prototype in HDF5-module specific lists:
     funclist = get!(bound_api, uppercase(prefix), Vector{String}(undef, 0))
-    push!(funclist, string(funcsig))
+    string(funcsig) in funclist || push!(funclist, string(funcsig))  # duplicate method may occur when we have multiply defined methods due to hdf5 compatibility bounds

     # Determine the underlying C library to call
     lib = startswith(string(cfuncname), r"H5(DO|DS|LT|TB)") ? :libhdf5_hl : :libhdf5

@musm
Copy link
Member

musm commented Dec 24, 2020

LGTM I'd just squash and rearange the commits that deal with the binding process

@musm
Copy link
Member

musm commented Dec 24, 2020

The third commit is what actually does the version-specific hacking. I'm not convinced yet that there isn't a smarter way to go about this, but this is what came to mind and I could make work.

Yeah, I'm sure we could make it more elegant or use a different format to parse the compatibility bounds, but it's simple enough and I couldn't think of anything better.

@jmert
Copy link
Contributor Author

jmert commented Dec 27, 2020

LGTM I'd just squash and rearange the commits that deal with the binding process

I've rebased and squashed to just three commits — first changes the key iteration implementation, second gets rid of most of the binding exceptions by having the generator automatically handle version-number suffixes, and the third adds the ability to specify that a particular binding is libhdf5-version-dependent.

Is that the level of squashing you were thinking of?

@musm musm merged commit a73b5c6 into JuliaIO:master Dec 28, 2020
@musm
Copy link
Member

musm commented Dec 28, 2020

Thanks! This is indeed a much needed change, and I'm glad we now have a way to address different library verisons.

@jmert jmert deleted the iterate_keys branch December 28, 2020 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants