ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists #1439

siddharthteotia · 2017-12-21T00:55:21Z

The current implementation of setInitialCapacity() uses a factor of 5 for every level we go into list:

So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) and we start with an initial capacity of 128, we end up throwing OversizedAllocationException from the BigIntVector because at every level we increased the capacity by 5 and by the time we reached inner scalar that actually stores the data, we were well over max size limit per vector (1MB).

We saw this problem downstream when we failed to read deeply nested JSON data.

The potential fix is to use the factor of 5 only when we are down to the leaf vector. As the depth increases and we are still working with complex/list, we don't use the factor of 5.

cc @jacques-n , @BryanCutler , @icexelloss

siddharthteotia · 2017-12-21T02:19:29Z

I thought more about this and the implemented solution is debatable even though it fixes the overallocation exception problem.

For example, consider the newly added unit test for schema : LIST (LIST (INT))

Since each position in the top level vector has 1 or more lists, the number of offsets in the inner list will always be greater than offsets in its parent. This implies that there is some factor of increase in capacity as we go down the tree. In the context of unit test:

The value capacity of outer list is 2 and inner list is 4 because each position of outer list has 2 inner lists and then we have an int vector with value capacity 10 comprising of data across all inner lists.

So doing list.setInitialCapacity(2) -> innerList.setInitialCapacity(2) -> intVector.setInitialCapacity(2 * 5) will require expansion of offset buffer (and validity) of inner list.

The question really is if there is a reasonable way to increase the multiplier as we go down the nested lists. The current patch keeps it same until we get down to scalars and then we directly use a multiplier of 5.

However, this will potentially require re-allocation of internal buffers of each inner list vector as the user app writes data into deeply nested lists.

siddharthteotia · 2017-12-22T02:10:30Z

Ping.

jacques-n · 2017-12-22T04:05:56Z

I'm +1 on this approach. It may not be perfect but it is definitely far better than the old approach.

@jacques-n

The current implementation of setInitialCapacity() uses a factor of 5 for every level we go into list: So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) and we start with an initial capacity of 128, we end up throwing OversizedAllocationException from the BigIntVector because at every level we increased the capacity by 5 and by the time we reached inner scalar that actually stores the data, we were well over max size limit per vector (1MB). We saw this problem downstream when we failed to read deeply nested JSON data. The potential fix is to use the factor of 5 only when we are down to the leaf vector. As the depth increases and we are still working with complex/list, we don't use the factor of 5. cc @jacques-n , @BryanCutler , @icexelloss Author: siddharth <[email protected]> Closes apache#1439 from siddharthteotia/ARROW-1943 and squashes the following commits: d0adbad [siddharth] unit tests e2f21a8 [siddharth] fix imports d103436 [siddharth] ARROW-1943: handle setInitialCapacity for deeply nested lists

@jacques-n

The current implementation of setInitialCapacity() uses a factor of 5 for every level we go into list: So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) and we start with an initial capacity of 128, we end up throwing OversizedAllocationException from the BigIntVector because at every level we increased the capacity by 5 and by the time we reached inner scalar that actually stores the data, we were well over max size limit per vector (1MB). We saw this problem downstream when we failed to read deeply nested JSON data. The potential fix is to use the factor of 5 only when we are down to the leaf vector. As the depth increases and we are still working with complex/list, we don't use the factor of 5. cc @jacques-n , @BryanCutler , @icexelloss Author: siddharth <[email protected]> Closes apache#1439 from siddharthteotia/ARROW-1943 and squashes the following commits: d0adbad [siddharth] unit tests e2f21a8 [siddharth] fix imports d103436 [siddharth] ARROW-1943: handle setInitialCapacity for deeply nested lists

@jacques-n

The current implementation of setInitialCapacity() uses a factor of 5 for every level we go into list: So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) and we start with an initial capacity of 128, we end up throwing OversizedAllocationException from the BigIntVector because at every level we increased the capacity by 5 and by the time we reached inner scalar that actually stores the data, we were well over max size limit per vector (1MB). We saw this problem downstream when we failed to read deeply nested JSON data. The potential fix is to use the factor of 5 only when we are down to the leaf vector. As the depth increases and we are still working with complex/list, we don't use the factor of 5. cc @jacques-n , @BryanCutler , @icexelloss Author: siddharth <[email protected]> Closes apache#1439 from siddharthteotia/ARROW-1943 and squashes the following commits: d0adbad [siddharth] unit tests e2f21a8 [siddharth] fix imports d103436 [siddharth] ARROW-1943: handle setInitialCapacity for deeply nested lists

siddharthteotia added 3 commits December 20, 2017 16:43

ARROW-1943: handle setInitialCapacity for deeply nested lists

d103436

fix imports

e2f21a8

unit tests

d0adbad

siddharthteotia changed the title ~~ARROW-1943: handle setInitialCapacity for deeply nested lists~~ ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists Dec 21, 2017

siddharthteotia closed this in 8986521 Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists #1439

ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists #1439

Uh oh!

siddharthteotia commented Dec 21, 2017 •

edited

Loading

Uh oh!

siddharthteotia commented Dec 21, 2017 •

edited

Loading

Uh oh!

siddharthteotia commented Dec 22, 2017

Uh oh!

jacques-n commented Dec 22, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists #1439

ARROW-1943: [JAVA] handle setInitialCapacity for deeply nested lists #1439

Uh oh!

Conversation

siddharthteotia commented Dec 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddharthteotia commented Dec 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddharthteotia commented Dec 22, 2017

Uh oh!

jacques-n commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siddharthteotia commented Dec 21, 2017 •

edited

Loading

siddharthteotia commented Dec 21, 2017 •

edited

Loading

jacques-n commented Dec 22, 2017 •

edited

Loading