Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would the memory layout of Array{Union{T,Null},N} look like? #10

Closed
davidanthoff opened this issue May 17, 2017 · 17 comments
Closed

Comments

@davidanthoff
Copy link

I guess this is mostly a question for @vtjnash. I've read somewhere that things would have a similar memory layout as NullableArray does right now. So I assume the values would be stored in essentially something that looks like Array{T,N}? What about the null mask? I've read something about essentially an array of type tags? Would these type tags be some kind of pointer? If yes, I assume it would take the same amount of storage as an Int? Or would this type tag array use one byte per array element, like NullableArray? Or maybe even use one bit per element to indicate whether there is a value or not? The latter doesn't sound like a type tag array to me, but who knows :)

Would the actual values in the type tag correspond to true and false in some sense? I.e. a similar bit pattern as Bool? I guess if the type tags are more like pointers to types, things would be less straightforward, right?

In any case, any information would be great!

@ararslan
Copy link
Member

Jameson of course will know more given that he's the mad scientist behind the optimizations, but my understanding is that the memory layout will be like two arrays, one with the raw data and one of UInt8s that corresponds to the positions in the Union. (Using UInt8 allows selecting from among a larger Union than Bool would.) That's all on the C side though; AFAIK, on the Julia side it will just be an array with a non-concrete element type.

@nalimilan
Copy link
Member

Also, I guess the UInt8 type tag can be equal to 0 for the first type of the Union, and 1 for the second one, so that for Array{Union{Null, T}} the tag is false for nulls and true for non-nulls. Or maybe another rule for sorting types should be adopted so that order of appearance doesn't matter (e.g. singletons first). Having nulls appear as false would be better than the reverse since this convention is used by other software and binary formats, see JuliaLang/julia#18507.

@davidanthoff
Copy link
Author

I just looked around a bit more how other folks are handling this. Pandas 2.0 seems to move to a bit mask (see here), I think most SQL databases also use a bit mask and Apache Arrow also seems to use bit masks. So all of these use 1 bit, not 1 byte, for their mask.

I did find some very old comment in the NullableArrays issues/PRs that using a BitArray for the mask was faster for some operations but slower for others than Vector{Bool}, but no further discussion what kind of operations these were. I also saw that there were same changes to BitArray in the meantime, so I guess there is even a potential that the performance problems are less severe now?

Does anyone have more insight into that question in general? In particular, do we have some evidence that using a byte instead of a bit for the mask per value will allow us to eventually get similar performance to all the other packages on different platforms that use bits for their masks?

@quinnj
Copy link
Member

quinnj commented Jun 20, 2017

I'm not sure the performance was ever that big of a deal. I think @johnmyleswhite did some benchmarking across various settings and put all the code/results here.

One reason for doing a byte is that it allows us to generalize Arrays of isbits Unions to beyond just Vector{Union{T, Void}}, it can handle any number of isbits union types and performance will be similar.

@nalimilan
Copy link
Member

I think what John found was that (at the time at least) BitArray was somewhat slower, which is why NullableArray uses Vector{Bool} contrary to DataArray. Maybe there are cases where BitArray could be faster than Vector{Bool}, but in general Vector{Bool} should be quite efficient; its main drawback is the higher memory use IMO (though it's still reasonable, with a 12,5% overhead for a Float64 variable).

If BitArray really turns out to be more efficient, it might be possible to use it for Unions of two types. Would need to ask @vtjnash.

@davidanthoff
Copy link
Author

davidanthoff commented Jul 6, 2017

I talked with @vtjnash about this at juliacon, and I think his take was that a BitArray might be a little bit faster for some operations and a little bit slower for others. My sense was that he didn't think it would make much of a difference.

One thing I've been looking at lately is the whole Apache Arrow initiative. My sense is that they essentially are defining an in memory layout for tabular data and that they hope to have a whole set of libraries/tools/ecosystem that can operate on the same in-memory representation of tabular data, so that one can interop in this whole ecosystem with zero array copy operations. I'm not sure how important this is for the julia ecosystem, but there seems a lot of momentum behind that effort, and if the missingness mask in julia was a bit array we could probably interop with the whole apach array ecosystem without ever making copies of the arrays that hold the data in a table like structure. I guess my sense from reading up on their stuff is that the effort is a little bit like creating a BLAS/LAPACK for tabular data, i.e. something where there is a standard memory layout for tabular data, and then algorithms on these data structures could be shared between different programming environments. @quinnj, you looked at arrow as well, right? Does what I describe here square with your understanding of that effort?

@davidanthoff
Copy link
Author

Another consideration that might be relevant here is AVX-512, right? I don't know much about it, but aren't there new instructions that take a bitmask and enable one to e.g. vectorize conditional code based on that? I'm more or less copy-pasting buzzwords from here, I certainly haven't looked into this in depth. But at least at this superficial level it seems worth to investigate if these AVX-512 instructions that take bitmasks might be useful in speeding up some algorithms on vectors with missing values.

@nalimilan
Copy link
Member

I don't know about AVX512, but for most basic arithmetic operations (which are the ones which can benefit the most from vectorization), we can just apply the operation to all values (both present and missing), as long as they never throw an error, and only use the ones we want in the end. This is what null_safe_op was used for.

@quinnj
Copy link
Member

quinnj commented Sep 27, 2017

So to wrap this issue up, the representation in Base will use a bytemask for isbits Union arrays and type fields, which generalizes better for Unions of > 2 Union elements. It's pretty trivial to expand/compress between bytemask-bitmask, so the interop w/ Apache Arrow will be seamless.

More specifically, Arrow indeed aims to define a common in-memory layout for tabular data, while also defining a "transfer protocol", i.e. how those in-memory structures should serialize/deserialize over the wire so that processes, remote or IPC, can leverage the common layouts. In short, in a Julia Arrow.jl package, we could just use native Vector{Union{T, Null}} arrays, and in the serialize/deserialize steps, do the byte-to-bit transformations.

@quinnj quinnj closed this as completed Sep 27, 2017
@davidanthoff
Copy link
Author

Isn't the whole idea of Arrow that things don't get serialized/deserialized? My sense is that Wes' design is all about a zero-copy world, which we won't have with the Vector{Union{T,Null}} story. Now, it might well be that the step of just copying the missing mask is not a biggy, I just don't know, but the rest of the Arrow ecosystem seems to very much go in a way that doesn't require any copies at all.

@quinnj
Copy link
Member

quinnj commented Sep 28, 2017

There's always going to be a serialization/deserialization step, because you can't have objects exactly represent Arrow memory, at least in a way that you could do the equivalent of memcpy(pointer_from_obj_ref(obj), sizeof(obj)), though it's notable that you could get pretty close in Julia. I tried to express that Arrow also includes an explicit messaging protocol where you essentially define your own Base.write(io::IO, ::ArrowTable) that writes the exact Arrow bytes to the wire according to the spec. Obviously it's to your advantage to make your object format as close and as zero-copy as possible to the Arrow spec, but I wouldn't be surprised if other implementations have even more "tweaks" they have to make than a simple byte->bitmask translation.

@nalimilan
Copy link
Member

If we really want complete layout compatibility with Arrow, we could see whether it would make sense for Julia to store Array{Union{...}} objects with a BitArray the type tag instead of an Array{UInt8} when the union contains only two (concrete) types? That kind of optimization would make sense (though there are tradeoffs regarding memory use vs. speed, and depending on operations...). For Julia itself, the advantage of using a bitmap would be that it would be 8× faster to fill an array with nulls, which would be useful it that's the recommended constructor (cf. discussion at JuliaLang/julia#23721). For the data ecosystem, it would also make sense because not only Arrow/Feather, but also e.g. PostgreSQL and Pandas2 use a bitmap to store missing values.

However I agree it's not the end of the world if Julia keeps using an Array{UInt8}. We just need to wonder whether this is part of the official API, i.e. whether it can be modified after 1.0 if we want to apply this optimization. If it will be set in stone in 1.0, better consider it seriously. We could also make it explicit that the exact implementation might change, and provide high-level functions to access the underlying data structures without depending on their exact format: for example, if you create an Arrow-compatible bitmap using convert(BitArray, typetag(x::Array{Union{T, Null}})), the type of the type tag array can change in a future Julia version and convert will handle it transparently.

@davidanthoff
Copy link
Author

Yeah, I think that all sounds really good. I'm certainly not convinced that using bitmaps is the way to go, I'm just genuinely unsure what the right thing to do is on that front, so keeping our options sounds like an important thing.

@vtjnash
Copy link

vtjnash commented Sep 28, 2017

I looked a bit into Arrow compatibility, but their format is designed to be immutable. That's a non-starter for Julia's built-in array type.

advantage of using a bitmap would be that it would be 8× faster to fill an array with nulls

I haven't benchmarked this, but this sounds flat wrong to me. The main item I take issue with is that filling the array is not 8x faster and nor are you aren't saving 8x memory. At most, you're saving a bit less than 2x memory (for Nullable{UInt8}), and usually you're saving a bit less than 10% (7 bits out of 72 for Nullable{Float64}). This might even be OK if all you ever wanted to do with data was run opaque SIMD kernels on it (where the newest chips have some new instructions to handle this format). But for any other algorithm (random access, ARM or older AMD64 hardware, or even just writing to the array), you'll pay a hefty performance penalty to swizzle the bits.

@nalimilan
Copy link
Member

With 8×, I just referred to the fact that to fill an uninitialized Array{Union{T, Null}} with null, zeroing the type tag array is enough, so making it 8× smaller is 8× more efficient. Therefore the overhead of using an array of nulls rather than an uninitialized array is lower when using a bitmask. I don't disagree the gain is much less noticeable in terms of total memory use, nor that performance can even be lower in general.

@vtjnash
Copy link

vtjnash commented Sep 28, 2017

that to fill an uninitialized Array{Union{T, Null}} with null, zeroing the type tag array is enough, so making it 8× smaller is 8× more efficient

Still not entirely true, since this implies the Array is subsequently unused.

@nalimilan
Copy link
Member

Still not entirely true, since this implies the Array is subsequently unused.

I was thinking in the context of the overhead involved by forcing one to fill an array with nulls on construction rather than allocating an uninitialized array (JuliaLang/julia#23721). But let's not divert this thread too much as it's not really related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants