Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Sep 29, 2020

When a dictionary changes from the previous batch, it is emitted again in the IPC stream.
If this happens when writing the IPC file format, an error is returned.

@pitrou
Copy link
Member Author

pitrou commented Sep 29, 2020

@wesm This is a first draft. It is functional (i.e. fixes the underlying issue) but doesn't try to detect deltas.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will populate last_dictionaries_ entries even when the length of the dictionary is zero? I recall there was some discussion about empty dictionaries (and whether they need to be written at all) at the start of a stream

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can skip zero-length dictionaries explicitly, but won't it confuse the IPC reader?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue

@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from c414eec to cd759d0 Compare September 30, 2020 14:06
@pitrou
Copy link
Member Author

pitrou commented Sep 30, 2020

I still need to add tests for this (I'll do that once #8309 is merged). Detecting and writing delta dictionaries is not implemented, but that's a feature, not a bug, so it may also wait for another PR.
(also, it will have to be disabled by default, for compatibility with other IPC readers)

@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from cd759d0 to 3e5e6df Compare October 1, 2020 14:51
@pitrou pitrou marked this pull request as ready for review October 1, 2020 14:51
@pitrou
Copy link
Member Author

pitrou commented Oct 5, 2020

@wesm Do you want to review this again? Otherwise perhaps @bkietz ?

pitrou added 3 commits October 5, 2020 11:29
When a dictionary changes from the previous batch, we should re-emit it.
Move tensor tests in separate file
@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from 3e5e6df to 4e5dac1 Compare October 5, 2020 09:32
@wesm
Copy link
Member

wesm commented Oct 5, 2020

I can take a final look this morning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants