Open
Conversation
Member
|
I think switching to 80 is OK. |
Member
|
Sounds reasonable, but I'd double-check the alignment issues, as AFAICT a julia> sizeof(fill((1, Int16(1)), 1000))/1000
16.0
julia> sizeof(fill((1, 1), 1000))/1000
16.0Also FWIW it seems that arrays are 64-bit aligned when they get large. |
Member
Author
|
Yeah, upon further investigation, there are some very foundational alignment requirements, even for primitive types, that mean that PosLen will always take up 128-bits. Hmmmm, I'll have to think about this a little more; unfortunately, 128-bits means we effectively "waste" quite a lot of bits for a, IMO, rare use-case. Perhaps there's an alternative solution in CSV.jl where we can avoid using PosLen to parse large strings at the user's request. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cc: @nickrobinson251 @bkamins @nalimilan
I'm exploring solutions to JuliaData/CSV.jl#935. If you look at the first commit (
"code"), I have 2 PosLen types, normalPosLenwith 64-bits, andPosLen80, with 80-bits. As I started to look into how we would use that in CSV.jl; it got real messy real fast. We'd have to allow switching based on PosLen type and pass it through everywhere and it would be a mess to ensure performance stays correct throughout. I'm not saying it's impossible, just some real non-trivial work. There are additional complications because PosLenString/PosLenStringVector are hard-coded for PosLen right now, so we'd either have to make them parameterized onAbstractPosLensubtype, or not allowstringtype=PosLenStringto avoid the mess there.Alternatively, in the 2nd commit, I redefine the existing
PosLento have 80 bits, which results in maximumposof ~70TB, and maximum single cell length of ~4GB. There are only 2-3 failing tests in Parsers.jl with this change, mainly from the removal of constants that are referenced in tests. I also ran CSV.jl's tests and with some minor changes to use of internal Parsers.jl consts, the tests also pass there. So the question is: are we ok w/ making thePosLensize go from 64-bits to 80-bits everywhere? Obviously that will result in more memory in thestringtype=PosLenStringcase, but surprisingly the impact would be pretty minimal otherwise. We usePosLenin several places during the "detection" code, but that's only looking at a small sample of rows here and there or parsing column names, so unlikely to make a noticeable difference in memory usage.Why 80 bits? It seemed like the smallest increase that gives us the largest boost in
pos/lenvalues. Note the extra 16 bits vs. current 64-bitPosLenare allocated with 12 bits tolenand only 4 topos.posalready had a max size of around ~4TB, which seems like a pretty reasonable maximum already. With 12 extra bits for thelen, we go from the max single cell size of ~1MB to ~4GB, which also seems like a pretty generous maximum. 80 bits also seems pretty reasonable from an alignment standpoint, since it's modulo 16 at least, which I believe is the default alignment value for Julia arrays/strings.Anyway, I'm going to let this simmer in my mind for a bit and mull things over, but I'm leaning towards going forward with it.