-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managing array memory #104
Comments
I'm not sure what exactly you have in mind. If we really want to go the manual memory management route, I'd look at Libc which provides
If you have specific use cases in mind, then making your own allocator would be more efficient than Would |
I think this JuliaLang/julia#24909 nicely illustrates the problem I'd like to be able to solve. I guess the long term solution would be having something like a memory buffer type like: mutable struct MutableInt <: Integer
x::Int
end
function unsafe_grow_end!(x::OptionallyStaticUnitRange{Int,MutableInt}, n::UInt)
x.stop.x += n
end
struct CPUBuffer{T,D<:Ref{T},B,E}
data::D
indices::OptionallyStaticRange{B,E}
flag::UInt16
end
function unsafe_grow_end!(x::CPUBuffer{T,D,B,E}, n::UInt) where {T,D,B,E<:MutableInt}
@assert x.flag == 0 # flag that x owns data (not shared)
unsafe_grow_end!(x.indices, n)
Libc.realloc(x.data, length(x.indices))
return nothing
end But I'm assuming getting basic garbage collection working for this would be a nightmare and was hoping to meet half way for now with something like |
Interesting, I hadn't seen that issue before. I assume you saw the short term solution: https://github.com/tpapp/PushVectors.jl From the issue, it sounds like the really long term solution for faster pushing is to move more of the Re garbage collection, making Still, it'd be interesting to benchmark it vs base to see just how slow |
TBH, I hadn't given code gen a thought. However,
According to vtjnash I'm starting to think we need to have a better way of talking to the GC and device memory before we can really solve this problem because the information necessary to make related methods possible is usually the same until we get down to the hardware. How nice would it be if we could just throw developers have a way to easily interact with this I'm sure we'll continue to see new places we can improve performance in addition to I know that's sort of hand wavy, but it's difficult to put together That sort of organic growth and optimization is what makes Julia great. ere are a lot of things like this that come up where it should be possible to optimize something a bit more but we either come up with hacks like |
One of the biggest reasons I want this is b/c it would make it possible to have AxisIndices.jl to just provide per dimension traits to a memory buffer with something like: struct NDArray{T,N,B,Axes<:Tuple{Vararg{Any,N}}} <: AbstractArray{T,N}
buffer::B
axes::Axes
end This makes it so that |
@chriselrod. Does the type info provide anymore information on how you generate code for vectorization beyond its size (a part from what |
Which type info? Just the element type, or everything, including size and strides? julia> using LoopVectorization, StaticArrays
julia> x = @MVector rand(10);
julia> function sumavx(A)
s = zero(eltype(A))
@avx for i ∈ eachindex(A)
s += A[i]
end
s
end
sumavx (generic function with 1 method)
julia> @cn sumavx(x) .section __TEXT,__text,regular,pure_instructions
ldp q0, q1, [x0]
ldp q2, q3, [x0, #32]
ldr q4, [x0, #64]
fadd v0.2d, v0.2d, v2.2d
fadd v1.2d, v1.2d, v3.2d
fadd v0.2d, v0.2d, v4.2d
fadd v0.2d, v1.2d, v0.2d
faddp d0, v0.2d
ret The asm is completely unrolled. The ARM chip has 128 it (2 x double) NEON registers. .text
mov al, 3
kmovd k1, eax
vmovupd zmm0 {k1} {z}, zmmword ptr [rdi + 64]
vmovapd xmm0, xmm0
vaddpd zmm0, zmm0, zmmword ptr [rdi]
vextractf64x4 ymm1, zmm0, 1
vaddpd zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
nop dword ptr [rax] This would probably be faster if LoopVectorization used smaller vectors. Perhaps that's an optimization I should implement. julia> bitstring(0x03)
"00000011" meaning the first two are on, and the remaining vector lanes are off. That is, load only the first two elements of the vector, i.e. just elements 9 and 10. So in both cases it used the exact size (in the latter case to create a bit mask). It also needs to know the stride between elements. julia> A = @MMatrix rand(2,10);
julia> a = view(A,1,:);
julia> strides(a)
(2,)
julia> ArrayInterface.strides(a)
(static(1),) Because of this, |
I was just thinking of a linear set of elements. I'm trying to address this issue by creating something like For example, if we didn't need the physical buffer (just size info) to transform indices to an optimized state then we could stuff like this... function sum(x)
b, p = buffer_pointer(x)
inds = to_index(x, :)
@gc_preserve b, out = sum_pointer(p, inds)
return out
end
function sum_pointer(p, inds)
out = zero(eltype(p))
@inbounds for i in inds
out += p[i]
end
end function unsafe_get_collection(A, inds::AbstractRange)
axis = to_axis(A, inds)
bsrc, psrc = buffer_pointer(x)
bdst, psrc = allocate_memory(A, axis)
@gc_preserve bsrc, bdst copyto!(pdst, indices(axis), psrc, to_index(x, inds))
return initialize(bdst, axis)
end Then we could take advantage of some of the stride info here for transforming the indices and punt a bunch of the memory stuff off to the packages providing the buffer and pointer type which would define their own method for Of course, all of this is moot if it would make what you're doing impossible. |
@chriselrod correct me if I'm wrong here, but it seems like the path forward is to have ArrayInterface depend on ManualMemory.jl (which I anticipate remaining a very light-weight dependency). |
The formal interface for working with an array's memory is currently very limited. We mostly just have ways of accessing preallocated memory (e.g.,
unsafe_wrap
andunsafe_pointer_to_objref
). The rest of the time we need to rely on methods likepush!
/pop!
/append!
/resize
to mutate aVector
down the line. Likewise, if we want to allocate memory for a new array we typically need to wrap a new instance ofArray
.A couple reasons we this would be good to have here
resize!
has to ensure that we don't create a situation where we are out of bounds. This probably only has minimal overhead in most cases but if we are doing something complicated like merging and or sorting two vectors then we may be callingresize!
a lot. I recall this sort of thing coming up a lot when I was trying to work with graph algorithms last year that do a lot of insertion and deletion.Vector
directly uses methods that allocate and write to memory (e.g.,Base._growend!
andBase.arrayset!
) while abstract types typically hope thatresize!
works and then usesetindex!
.From where I'm standing it seams like this would be great to have here. The reason this is an issue instead of a PR is mainly because I'm unsure if the implementation for this would be too general or involved for this package. We could always define several methods here like
unsafe_grow_end!
andunsafe_shrink_end!
but these aren't super helpful without implementations that can interact with pointers and references.@chriselrod, do you think growing/shrinking/allocating memory in an efficient way would require a bunch of LLVM magic or could we do it pretty simply here?
The text was updated successfully, but these errors were encountered: