Bump PosLen capacity up to 80 bits by quinnj · Pull Request #98 · JuliaData/Parsers.jl

quinnj · 2021-10-22T06:12:01Z

cc: @nickrobinson251 @bkamins @nalimilan

I'm exploring solutions to JuliaData/CSV.jl#935. If you look at the first commit ("code"), I have 2 PosLen types, normal PosLen with 64-bits, and PosLen80, with 80-bits. As I started to look into how we would use that in CSV.jl; it got real messy real fast. We'd have to allow switching based on PosLen type and pass it through everywhere and it would be a mess to ensure performance stays correct throughout. I'm not saying it's impossible, just some real non-trivial work. There are additional complications because PosLenString/PosLenStringVector are hard-coded for PosLen right now, so we'd either have to make them parameterized on AbstractPosLen subtype, or not allow stringtype=PosLenString to avoid the mess there.

Alternatively, in the 2nd commit, I redefine the existing PosLen to have 80 bits, which results in maximum pos of ~70TB, and maximum single cell length of ~4GB. There are only 2-3 failing tests in Parsers.jl with this change, mainly from the removal of constants that are referenced in tests. I also ran CSV.jl's tests and with some minor changes to use of internal Parsers.jl consts, the tests also pass there. So the question is: are we ok w/ making the PosLen size go from 64-bits to 80-bits everywhere? Obviously that will result in more memory in the stringtype=PosLenString case, but surprisingly the impact would be pretty minimal otherwise. We use PosLen in several places during the "detection" code, but that's only looking at a small sample of rows here and there or parsing column names, so unlikely to make a noticeable difference in memory usage.

Why 80 bits? It seemed like the smallest increase that gives us the largest boost in pos/len values. Note the extra 16 bits vs. current 64-bit PosLen are allocated with 12 bits to len and only 4 to pos. pos already had a max size of around ~4TB, which seems like a pretty reasonable maximum already. With 12 extra bits for the len, we go from the max single cell size of ~1MB to ~4GB, which also seems like a pretty generous maximum. 80 bits also seems pretty reasonable from an alignment standpoint, since it's modulo 16 at least, which I believe is the default alignment value for Julia arrays/strings.

Anyway, I'm going to let this simmer in my mind for a bit and mull things over, but I'm leaning towards going forward with it.

bkamins · 2021-10-22T07:30:42Z

I think switching to 80 is OK.

nalimilan · 2021-10-23T13:32:48Z

Sounds reasonable, but I'd double-check the alignment issues, as AFAICT a Vector of 80-bit values takes as much space as a Vector of 128-bit values:

julia> sizeof(fill((1, Int16(1)), 1000))/1000
16.0

julia> sizeof(fill((1, 1), 1000))/1000
16.0

Also FWIW it seems that arrays are 64-bit aligned when they get large.

quinnj · 2021-10-26T03:36:47Z

Yeah, upon further investigation, there are some very foundational alignment requirements, even for primitive types, that mean that PosLen will always take up 128-bits. Hmmmm, I'll have to think about this a little more; unfortunately, 128-bits means we effectively "waste" quite a lot of bits for a, IMO, rare use-case. Perhaps there's an alternative solution in CSV.jl where we can avoid using PosLen to parse large strings at the user's request.

quinnj added 2 commits October 21, 2021 23:18

code

3b5cf62

Single PosLen w/ 80-bit capacity

0da4a2d

quinnj mentioned this pull request Oct 22, 2021

Unable to parse really long CSV cell (breaks Parsers.jl) JuliaData/CSV.jl#935

Open

nickrobinson251 mentioned this pull request Jun 20, 2022

Parsing fails with long strings JuliaData/CSV.jl#1009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump PosLen capacity up to 80 bits#98

Bump PosLen capacity up to 80 bits#98
quinnj wants to merge 2 commits intomainfrom
jq/poslen80

quinnj commented Oct 22, 2021

Uh oh!

bkamins commented Oct 22, 2021

Uh oh!

nalimilan commented Oct 23, 2021

Uh oh!

quinnj commented Oct 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quinnj commented Oct 22, 2021

Uh oh!

bkamins commented Oct 22, 2021

Uh oh!

nalimilan commented Oct 23, 2021

Uh oh!

quinnj commented Oct 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants