Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Apache Arrow as a normalized data representation #2115

Merged
merged 22 commits into from
Aug 5, 2024
Merged

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Jul 28, 2024

This allows an Apache Arrow Table to be used directly as the normalized representation of mark data (computed in mark.initialize), rather than requiring conversion to an array. This requires patching a few places to support both types, and then finally in valueof we materialize the columns as arrays using table.getChild and vector.toArray, which should be as efficient as possible.

It’s a bit tricky without type checking to make sure we haven’t missed things, but it doesn’t seem too bad, and we should be able to find the places with tests.

This also uncovered a number of preexisting problems with BigInt coercion (since Arrow represents dates as BigInt).

Fixes #191.
Fixes observablehq/framework#1376.
Supersedes #2030.
Supersedes #2096.

@mbostock mbostock requested a review from Fil July 28, 2024 13:54
src/mark.d.ts Show resolved Hide resolved
@@ -70,7 +92,7 @@ export function percentile(reduce) {

// If the values are specified as a typed array, no coercion is required.
export function coerceNumbers(values) {
return values instanceof TypedArray ? values : map(values, coerceNumber, Float64Array);
return isNumberArray(values) ? values : map(values, coerceNumber, Float64Array);
Copy link
Member Author

@mbostock mbostock Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This fixes coerceNumbers for BigInt arrays which is important because Apache Arrow uses BigInt64Array for dates… somewhat unnecessarily in my opinion when Float64Array would work better in practice.)

@mbostock mbostock marked this pull request as ready for review July 28, 2024 17:28
@Fil
Copy link
Contributor

Fil commented Aug 5, 2024

I think this works! I will close #2030 because the description and discussion here is much better.

@Fil Fil changed the base branch from fil/arrow to main August 5, 2024 02:32
Fil added 2 commits August 5, 2024 04:34
@Fil
Copy link
Contributor

Fil commented Aug 5, 2024

If we convert on the fly all the eligible data to Arrow tables with tableFromJSON, and run the tests, we see a few things:

  • most tests pass
  • a few tests crash when they try to call Date methods directly on the values (which are not dates anymore but numbers)
  • a few tests result in different charts — mostly because invalid values are converted to 0 by tableFromJSON
  • textOverflowClip &c crash because they use a custom transform which calls data.flatMap.

Nothing to see there.

More interestingly, we see crashes with:

  • differenceY &c, and findArrow: because Plot.find in group.js reads data[i] directly
  • musicRevenueCustomOrder crashes because its custom order function gets passed undefined values (error coming from orderComparator in stack.js reading data[i] directly)

… and that is all!

So I guess these are two places that need an additional arrayify. [FIXED]

code for the global conversion

in options.js, replace dataify by this version:

import {tableFromJSON} from "apache-arrow";

// Like arrayify, but also allows data to be an Apache Arrow Table.
export function dataify(data) {
  return isArrowTable(data)
    ? data
    : Array.isArray((data = arrayify(data))) && data.length > 0 && isObject(data[0]) && !data[0].type
    ? tableFromJSON(data)
    : data;
}

@Fil
Copy link
Contributor

Fil commented Aug 5, 2024

We'll have to review all the remaining occurrences of data[:

@domoritz
Copy link
Contributor

domoritz commented Aug 5, 2024

a few tests crash when they try to call Date methods directly on the values (which are not dates anymore but numbers)

It would probably be a good to have some kind of compatibility mode for arrow that returns dates instead of numbers. Jeff suggested this before.

src/transforms/group.js Outdated Show resolved Hide resolved
src/transforms/stack.js Outdated Show resolved Hide resolved
@mbostock
Copy link
Member Author

mbostock commented Aug 5, 2024

Great job testing and finding these remaining incompatibilities, @Fil! Thank you.

@mbostock
Copy link
Member Author

mbostock commented Aug 5, 2024

I reviewed the other places you identified and optimized the handling of Arrow tables. PTAL!

@mbostock mbostock enabled auto-merge (squash) August 5, 2024 14:13
@Fil
Copy link
Contributor

Fil commented Aug 5, 2024

I've added a (convoluted) test for the code in tree.js, and a fix. I've also added tests for some of the other places we identified. Note that the comparator sort was not tested at all (with arrays nor arrow) until now!

Fil and others added 2 commits August 5, 2024 11:12
@mbostock
Copy link
Member Author

mbostock commented Aug 5, 2024

Looks good to me! Please approve if you agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display Apache Arrow dates in Inputs.table (and Plot) Optimize field shorthand for columnar data?
3 participants