What would the memory layout of Array{Union{T,Null},N} look like? #10

davidanthoff · 2017-05-17T23:43:32Z

I guess this is mostly a question for @vtjnash. I've read somewhere that things would have a similar memory layout as NullableArray does right now. So I assume the values would be stored in essentially something that looks like Array{T,N}? What about the null mask? I've read something about essentially an array of type tags? Would these type tags be some kind of pointer? If yes, I assume it would take the same amount of storage as an Int? Or would this type tag array use one byte per array element, like NullableArray? Or maybe even use one bit per element to indicate whether there is a value or not? The latter doesn't sound like a type tag array to me, but who knows :)

Would the actual values in the type tag correspond to true and false in some sense? I.e. a similar bit pattern as Bool? I guess if the type tags are more like pointers to types, things would be less straightforward, right?

In any case, any information would be great!

The text was updated successfully, but these errors were encountered:

ararslan · 2017-05-17T23:52:46Z

Jameson of course will know more given that he's the mad scientist behind the optimizations, but my understanding is that the memory layout will be like two arrays, one with the raw data and one of UInt8s that corresponds to the positions in the Union. (Using UInt8 allows selecting from among a larger Union than Bool would.) That's all on the C side though; AFAIK, on the Julia side it will just be an array with a non-concrete element type.

nalimilan · 2017-05-18T11:52:22Z

Also, I guess the UInt8 type tag can be equal to 0 for the first type of the Union, and 1 for the second one, so that for Array{Union{Null, T}} the tag is false for nulls and true for non-nulls. Or maybe another rule for sorting types should be adopted so that order of appearance doesn't matter (e.g. singletons first). Having nulls appear as false would be better than the reverse since this convention is used by other software and binary formats, see JuliaLang/julia#18507.

davidanthoff · 2017-06-20T17:07:30Z

I just looked around a bit more how other folks are handling this. Pandas 2.0 seems to move to a bit mask (see here), I think most SQL databases also use a bit mask and Apache Arrow also seems to use bit masks. So all of these use 1 bit, not 1 byte, for their mask.

I did find some very old comment in the NullableArrays issues/PRs that using a BitArray for the mask was faster for some operations but slower for others than Vector{Bool}, but no further discussion what kind of operations these were. I also saw that there were same changes to BitArray in the meantime, so I guess there is even a potential that the performance problems are less severe now?

Does anyone have more insight into that question in general? In particular, do we have some evidence that using a byte instead of a bit for the mask per value will allow us to eventually get similar performance to all the other packages on different platforms that use bits for their masks?

quinnj · 2017-06-20T17:32:53Z

I'm not sure the performance was ever that big of a deal. I think @johnmyleswhite did some benchmarking across various settings and put all the code/results here.

One reason for doing a byte is that it allows us to generalize Arrays of isbits Unions to beyond just Vector{Union{T, Void}}, it can handle any number of isbits union types and performance will be similar.

nalimilan · 2017-06-20T18:30:02Z

I think what John found was that (at the time at least) BitArray was somewhat slower, which is why NullableArray uses Vector{Bool} contrary to DataArray. Maybe there are cases where BitArray could be faster than Vector{Bool}, but in general Vector{Bool} should be quite efficient; its main drawback is the higher memory use IMO (though it's still reasonable, with a 12,5% overhead for a Float64 variable).

If BitArray really turns out to be more efficient, it might be possible to use it for Unions of two types. Would need to ask @vtjnash.

davidanthoff · 2017-07-06T14:48:14Z

I talked with @vtjnash about this at juliacon, and I think his take was that a BitArray might be a little bit faster for some operations and a little bit slower for others. My sense was that he didn't think it would make much of a difference.

One thing I've been looking at lately is the whole Apache Arrow initiative. My sense is that they essentially are defining an in memory layout for tabular data and that they hope to have a whole set of libraries/tools/ecosystem that can operate on the same in-memory representation of tabular data, so that one can interop in this whole ecosystem with zero array copy operations. I'm not sure how important this is for the julia ecosystem, but there seems a lot of momentum behind that effort, and if the missingness mask in julia was a bit array we could probably interop with the whole apach array ecosystem without ever making copies of the arrays that hold the data in a table like structure. I guess my sense from reading up on their stuff is that the effort is a little bit like creating a BLAS/LAPACK for tabular data, i.e. something where there is a standard memory layout for tabular data, and then algorithms on these data structures could be shared between different programming environments. @quinnj, you looked at arrow as well, right? Does what I describe here square with your understanding of that effort?

davidanthoff · 2017-07-12T08:07:03Z

Another consideration that might be relevant here is AVX-512, right? I don't know much about it, but aren't there new instructions that take a bitmask and enable one to e.g. vectorize conditional code based on that? I'm more or less copy-pasting buzzwords from here, I certainly haven't looked into this in depth. But at least at this superficial level it seems worth to investigate if these AVX-512 instructions that take bitmasks might be useful in speeding up some algorithms on vectors with missing values.

nalimilan · 2017-07-12T13:45:50Z

I don't know about AVX512, but for most basic arithmetic operations (which are the ones which can benefit the most from vectorization), we can just apply the operation to all values (both present and missing), as long as they never throw an error, and only use the ones we want in the end. This is what null_safe_op was used for.

quinnj · 2017-09-27T11:33:14Z

So to wrap this issue up, the representation in Base will use a bytemask for isbits Union arrays and type fields, which generalizes better for Unions of > 2 Union elements. It's pretty trivial to expand/compress between bytemask-bitmask, so the interop w/ Apache Arrow will be seamless.

More specifically, Arrow indeed aims to define a common in-memory layout for tabular data, while also defining a "transfer protocol", i.e. how those in-memory structures should serialize/deserialize over the wire so that processes, remote or IPC, can leverage the common layouts. In short, in a Julia Arrow.jl package, we could just use native Vector{Union{T, Null}} arrays, and in the serialize/deserialize steps, do the byte-to-bit transformations.

davidanthoff · 2017-09-28T00:09:18Z

Isn't the whole idea of Arrow that things don't get serialized/deserialized? My sense is that Wes' design is all about a zero-copy world, which we won't have with the Vector{Union{T,Null}} story. Now, it might well be that the step of just copying the missing mask is not a biggy, I just don't know, but the rest of the Arrow ecosystem seems to very much go in a way that doesn't require any copies at all.

quinnj · 2017-09-28T00:16:06Z

There's always going to be a serialization/deserialization step, because you can't have objects exactly represent Arrow memory, at least in a way that you could do the equivalent of memcpy(pointer_from_obj_ref(obj), sizeof(obj)), though it's notable that you could get pretty close in Julia. I tried to express that Arrow also includes an explicit messaging protocol where you essentially define your own Base.write(io::IO, ::ArrowTable) that writes the exact Arrow bytes to the wire according to the spec. Obviously it's to your advantage to make your object format as close and as zero-copy as possible to the Arrow spec, but I wouldn't be surprised if other implementations have even more "tweaks" they have to make than a simple byte->bitmask translation.

nalimilan · 2017-09-28T08:08:31Z

If we really want complete layout compatibility with Arrow, we could see whether it would make sense for Julia to store Array{Union{...}} objects with a BitArray the type tag instead of an Array{UInt8} when the union contains only two (concrete) types? That kind of optimization would make sense (though there are tradeoffs regarding memory use vs. speed, and depending on operations...). For Julia itself, the advantage of using a bitmap would be that it would be 8× faster to fill an array with nulls, which would be useful it that's the recommended constructor (cf. discussion at JuliaLang/julia#23721). For the data ecosystem, it would also make sense because not only Arrow/Feather, but also e.g. PostgreSQL and Pandas2 use a bitmap to store missing values.

However I agree it's not the end of the world if Julia keeps using an Array{UInt8}. We just need to wonder whether this is part of the official API, i.e. whether it can be modified after 1.0 if we want to apply this optimization. If it will be set in stone in 1.0, better consider it seriously. We could also make it explicit that the exact implementation might change, and provide high-level functions to access the underlying data structures without depending on their exact format: for example, if you create an Arrow-compatible bitmap using convert(BitArray, typetag(x::Array{Union{T, Null}})), the type of the type tag array can change in a future Julia version and convert will handle it transparently.

davidanthoff · 2017-09-28T16:15:31Z

Yeah, I think that all sounds really good. I'm certainly not convinced that using bitmaps is the way to go, I'm just genuinely unsure what the right thing to do is on that front, so keeping our options sounds like an important thing.

vtjnash · 2017-09-28T17:41:38Z

I looked a bit into Arrow compatibility, but their format is designed to be immutable. That's a non-starter for Julia's built-in array type.

advantage of using a bitmap would be that it would be 8× faster to fill an array with nulls

I haven't benchmarked this, but this sounds flat wrong to me. The main item I take issue with is that filling the array is not 8x faster and nor are you aren't saving 8x memory. At most, you're saving a bit less than 2x memory (for Nullable{UInt8}), and usually you're saving a bit less than 10% (7 bits out of 72 for Nullable{Float64}). This might even be OK if all you ever wanted to do with data was run opaque SIMD kernels on it (where the newest chips have some new instructions to handle this format). But for any other algorithm (random access, ARM or older AMD64 hardware, or even just writing to the array), you'll pay a hefty performance penalty to swizzle the bits.

nalimilan · 2017-09-28T18:22:38Z

With 8×, I just referred to the fact that to fill an uninitialized Array{Union{T, Null}} with null, zeroing the type tag array is enough, so making it 8× smaller is 8× more efficient. Therefore the overhead of using an array of nulls rather than an uninitialized array is lower when using a bitmask. I don't disagree the gain is much less noticeable in terms of total memory use, nor that performance can even be lower in general.

vtjnash · 2017-09-28T19:26:21Z

that to fill an uninitialized Array{Union{T, Null}} with null, zeroing the type tag array is enough, so making it 8× smaller is 8× more efficient

Still not entirely true, since this implies the Array is subsequently unused.

nalimilan · 2017-09-28T19:44:07Z

Still not entirely true, since this implies the Array is subsequently unused.

I was thinking in the context of the overhead involved by forcing one to fill an array with nulls on construction rather than allocating an uninitialized array (JuliaLang/julia#23721). But let's not divert this thread too much as it's not really related.

quinnj closed this as completed Sep 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would the memory layout of Array{Union{T,Null},N} look like? #10

What would the memory layout of Array{Union{T,Null},N} look like? #10

davidanthoff commented May 17, 2017

ararslan commented May 17, 2017

nalimilan commented May 18, 2017

davidanthoff commented Jun 20, 2017

quinnj commented Jun 20, 2017

nalimilan commented Jun 20, 2017

davidanthoff commented Jul 6, 2017 •

edited

Loading

davidanthoff commented Jul 12, 2017

nalimilan commented Jul 12, 2017

quinnj commented Sep 27, 2017

davidanthoff commented Sep 28, 2017

quinnj commented Sep 28, 2017

nalimilan commented Sep 28, 2017

davidanthoff commented Sep 28, 2017

vtjnash commented Sep 28, 2017 •

edited

Loading

nalimilan commented Sep 28, 2017

vtjnash commented Sep 28, 2017

nalimilan commented Sep 28, 2017

What would the memory layout of Array{Union{T,Null},N} look like? #10

What would the memory layout of Array{Union{T,Null},N} look like? #10

Comments

davidanthoff commented May 17, 2017

ararslan commented May 17, 2017

nalimilan commented May 18, 2017

davidanthoff commented Jun 20, 2017

quinnj commented Jun 20, 2017

nalimilan commented Jun 20, 2017

davidanthoff commented Jul 6, 2017 • edited Loading

davidanthoff commented Jul 12, 2017

nalimilan commented Jul 12, 2017

quinnj commented Sep 27, 2017

davidanthoff commented Sep 28, 2017

quinnj commented Sep 28, 2017

nalimilan commented Sep 28, 2017

davidanthoff commented Sep 28, 2017

vtjnash commented Sep 28, 2017 • edited Loading

nalimilan commented Sep 28, 2017

vtjnash commented Sep 28, 2017

nalimilan commented Sep 28, 2017

davidanthoff commented Jul 6, 2017 •

edited

Loading

vtjnash commented Sep 28, 2017 •

edited

Loading