-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What would the memory layout of Array{Union{T,Null},N} look like? #10
Comments
Jameson of course will know more given that he's the mad scientist behind the optimizations, but my understanding is that the memory layout will be like two arrays, one with the raw data and one of |
Also, I guess the |
I just looked around a bit more how other folks are handling this. Pandas 2.0 seems to move to a bit mask (see here), I think most SQL databases also use a bit mask and Apache Arrow also seems to use bit masks. So all of these use 1 bit, not 1 byte, for their mask. I did find some very old comment in the NullableArrays issues/PRs that using a Does anyone have more insight into that question in general? In particular, do we have some evidence that using a byte instead of a bit for the mask per value will allow us to eventually get similar performance to all the other packages on different platforms that use bits for their masks? |
I'm not sure the performance was ever that big of a deal. I think @johnmyleswhite did some benchmarking across various settings and put all the code/results here. One reason for doing a byte is that it allows us to generalize Arrays of isbits Unions to beyond just |
I think what John found was that (at the time at least) If |
I talked with @vtjnash about this at juliacon, and I think his take was that a One thing I've been looking at lately is the whole Apache Arrow initiative. My sense is that they essentially are defining an in memory layout for tabular data and that they hope to have a whole set of libraries/tools/ecosystem that can operate on the same in-memory representation of tabular data, so that one can interop in this whole ecosystem with zero array copy operations. I'm not sure how important this is for the julia ecosystem, but there seems a lot of momentum behind that effort, and if the missingness mask in julia was a bit array we could probably interop with the whole apach array ecosystem without ever making copies of the arrays that hold the data in a table like structure. I guess my sense from reading up on their stuff is that the effort is a little bit like creating a BLAS/LAPACK for tabular data, i.e. something where there is a standard memory layout for tabular data, and then algorithms on these data structures could be shared between different programming environments. @quinnj, you looked at arrow as well, right? Does what I describe here square with your understanding of that effort? |
Another consideration that might be relevant here is AVX-512, right? I don't know much about it, but aren't there new instructions that take a bitmask and enable one to e.g. vectorize conditional code based on that? I'm more or less copy-pasting buzzwords from here, I certainly haven't looked into this in depth. But at least at this superficial level it seems worth to investigate if these AVX-512 instructions that take bitmasks might be useful in speeding up some algorithms on vectors with missing values. |
I don't know about AVX512, but for most basic arithmetic operations (which are the ones which can benefit the most from vectorization), we can just apply the operation to all values (both present and missing), as long as they never throw an error, and only use the ones we want in the end. This is what |
So to wrap this issue up, the representation in Base will use a bytemask for isbits Union arrays and type fields, which generalizes better for Unions of > 2 Union elements. It's pretty trivial to expand/compress between bytemask-bitmask, so the interop w/ Apache Arrow will be seamless. More specifically, Arrow indeed aims to define a common in-memory layout for tabular data, while also defining a "transfer protocol", i.e. how those in-memory structures should serialize/deserialize over the wire so that processes, remote or IPC, can leverage the common layouts. In short, in a Julia Arrow.jl package, we could just use native |
Isn't the whole idea of Arrow that things don't get serialized/deserialized? My sense is that Wes' design is all about a zero-copy world, which we won't have with the |
There's always going to be a serialization/deserialization step, because you can't have objects exactly represent Arrow memory, at least in a way that you could do the equivalent of |
If we really want complete layout compatibility with Arrow, we could see whether it would make sense for Julia to store However I agree it's not the end of the world if Julia keeps using an |
Yeah, I think that all sounds really good. I'm certainly not convinced that using bitmaps is the way to go, I'm just genuinely unsure what the right thing to do is on that front, so keeping our options sounds like an important thing. |
I looked a bit into Arrow compatibility, but their format is designed to be immutable. That's a non-starter for Julia's built-in array type.
I haven't benchmarked this, but this sounds flat wrong to me. The main item I take issue with is that filling the array is not 8x faster and nor are you aren't saving 8x memory. At most, you're saving a bit less than 2x memory (for |
With 8×, I just referred to the fact that to fill an uninitialized |
Still not entirely true, since this implies the Array is subsequently unused. |
I was thinking in the context of the overhead involved by forcing one to fill an array with nulls on construction rather than allocating an uninitialized array (JuliaLang/julia#23721). But let's not divert this thread too much as it's not really related. |
I guess this is mostly a question for @vtjnash. I've read somewhere that things would have a similar memory layout as
NullableArray
does right now. So I assume the values would be stored in essentially something that looks likeArray{T,N}
? What about the null mask? I've read something about essentially an array of type tags? Would these type tags be some kind of pointer? If yes, I assume it would take the same amount of storage as anInt
? Or would this type tag array use one byte per array element, likeNullableArray
? Or maybe even use one bit per element to indicate whether there is a value or not? The latter doesn't sound like a type tag array to me, but who knows :)Would the actual values in the type tag correspond to
true
andfalse
in some sense? I.e. a similar bit pattern asBool
? I guess if the type tags are more like pointers to types, things would be less straightforward, right?In any case, any information would be great!
The text was updated successfully, but these errors were encountered: