ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

trxcllnt · 2018-12-31T10:44:52Z

It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. Check out this sweet gif of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like uber/deck.gl 🎉

The JIRAs

In addition to everything I detail below, this PR closes the following JIRAs:

ARROW-2828: Refactor Vector Data classes
ARROW-2839: Support whatwg/streams in IPC reader/writer
ARROW-2235: Add tests for IPC messages split across multiple buffers
ARROW-3337: IPC writer doesn't serialize the dictionary of nested Vectors
ARROW-3689: Upgrade to TS 3.1
ARROW-3560: Remove @std/esm
ARROW-3561: Update ts-jest
ARROW-2778: Add Utf8Vector.from
ARROW-2766: Add ability to construct a Table from a list of Arrays/TypedArrays

The stats

The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0:

$ time npm run build
[22:11:04] Finished 'build' after 39 s

real	0m40.341s
user	4m55.428s
sys	0m5.559s

$ npm run test:coverage
=============================== Coverage summary ===============================
Statements   : 90.45% ( 4321/4777 )
Branches     : 76.7% ( 1570/2047 )
Functions    : 84.62% ( 1106/1307 )
Lines        : 91.5% ( 3777/4128 )
================================================================================

Test Suites: 21 passed, 21 total
Tests:       5644 passed, 5644 total
Snapshots:   0 total
Time:        16.023s

The fixes

Vector#indexOf(value) works for all DataTypes
Vector#set(i, value) now works for all DataTypes
Reading from node streams is now fully zero-copy
The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337)
DictionaryBatches marked as isDelta now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived
A few arrow2csv fixes:
- Ignore stdin if it's a TTY
- Now read all the Arrow formats from stdin
- Always show the help text when we don't understand the input
- Proper backpressure support to play nicely with other Unix utilities like head and less
Fixes an unfiled bug we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names

The upgrades

New zero-copy Message/RecordBatchReaders!
- RecordBatchReader.from() will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSON
- RecordBatchFileReader now supports random-access seek, enabling more efficient web-worker/multi-process workflows
- RecordBatchStreamReader can now read multiple tables from the same underlying socket
- MessageReader now guarantees/enforces message body byte alignment (this one even surfaced bugs in node core and the DOM streams polyfill)
New RecordBatchWriters
- Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter
- Adds static RecordBatchWriter.writeAll() method to easily write a Table or stream of RecordBatches
- Both sync and async flushes based on the WritableSink
Full integration with platform I/O primitives
- We can still synchronously read JSON, Buffers, Iterable<Buffer>, or AsyncIterable<Buffer>
- In node, we can now read from any ReadableStream, fs.FileHandle
- In the browser, we can read from any ReadableStream or ReadableByteStream, or the Response returned from the fetch() API. (Wrapping the FileReader is still todo)
- We also accept Promises of any of the above
- New convenience methods for integrating with node or DOM streams

Generic type parameters inherited from DataType now flow recursively

const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data);
table.get(0); // will be of type { str: string, i32: number, bools: BoolVector }

New simplified Data class
New simplified, faster Visitor class with support for optional, more narrow visitT implementations
- New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType)
New abstract Chunked base class for the applicative (concat) operation
- public chunkedInst.chunks field is the list of inner chunks
New Column class extends Chunked, combines Field with the chunks (provides access to the field name from the Schema)
RecordBatch#concat(...batchesOrTables) now returns a Table
Table now extends Chunked, so it inherits:
- Table#slice(from, to)
- Table#concat(...batchesOrTables)
- Table#getChildAt(i) exists, alias of getColumnAt(i)
Table#getColumn[At]() returns a Column

The breaking changes

All the old IPC functions are gone, but the new APIs will live for much longer
Table#batches is now Table#chunks, which it inherits from Chunked (maybe controversial, open to aliasing)
Table#batchesUnion is now just... the Table instance itself (also maybe controversial, open to aliasing)
DataType#TType is now DataType#typeId -- it should have always been this, was a typo. Easy to alias if necessary.
The complicated View classes are now gone, logic centralized as specialized Visitors

The tests

Tests no longer rely on any C++ or Java generated integration files
Integration tests have been moved into bin/integration.js, and they finish much quicker
The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience
A streaming RecordBatchJSONWriter has been implemented so we can easily debug and validate written output
- The JSON results are also tested against the corresponding binary representation, similar to the integration tests
A suite of test-data helpers have been added to auto-generate data for validation at runtime
- They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation for verification
- This allows us to test all possible type configuration combinations, e.g. all types Dictionary-encode, all types serialize when nested, etc.
A suite of IO test helpers has been added
- We use memfs to mock the file system, which contributes to test performance improvements
- This enables us to easily test all the flavors of io primitives across node and browser environments
A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often)

The build

Faster
Node 11+ (needs Symbol.asyncIterator enabled)
Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand

Misc

Added arrow2csv to js/bin/arrow2csv, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (cat foo.arrow | js/bin/arrow2csv.js)

Todos

Docs/Recipes/Examples
Highlight/write more tools (like arrow2csv)
Flesh out the RecordBatchWriters a bit more
Gather feedback on the new RecordBatchReader APIs

codecov-io · 2019-01-02T11:51:54Z

Codecov Report

Merging #3290 into master will decrease coverage by 0.1%.
The diff coverage is 90.49%.

@@            Coverage Diff             @@
##           master    #3290      +/-   ##
==========================================
- Coverage   88.57%   88.46%   -0.11%     
==========================================
  Files         540      631      +91     
  Lines       73106    79286    +6180     
  Branches        0     1069    +1069     
==========================================
+ Hits        64753    70140    +5387     
- Misses       8250     9031     +781     
- Partials      103      115      +12

Impacted Files	Coverage Δ
js/src/vector/interval.ts	`100% <100%> (ø)`
js/src/enum.ts	`100% <100%> (ø)`
js/src/Arrow.node.ts	`100% <100%> (ø)`
js/src/ipc/node/writer.ts	`100% <100%> (ø)`
js/src/ipc/node/iterable.ts	`100% <100%> (ø)`
js/src/vector/decimal.ts	`100% <100%> (ø)`
js/src/compute/predicate.ts	`100% <100%> (ø)`
js/src/vector/bool.ts	`100% <100%> (ø)`
js/src/Arrow.dom.ts	`100% <100%> (ø)`
js/src/visitor/set.ts	`100% <100%> (ø)`
... and 244 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46b1bc7...2ef150f. Read the comment docs.

xhochy · 2019-01-04T18:23:11Z

@trxcllnt This seems to include some unwanted commits. Can you rebase on latest master?

trxcllnt · 2019-01-04T18:35:01Z

@xhochy I've tried doing git rebase apache/master, but each time it rebases all the way back to the initial commit, then fails with a merge conflict on the README.md file. I'm happy to give it another shot if you have any tips (also available in slack).

xhochy · 2019-01-05T04:05:22Z

@trxcllnt Also had a look at this and I don't know what's going wrong here. Really puzzled.

wesm · 2019-01-05T05:48:12Z

I can look at fixing this tomorrow. How much do you care about maintaining the commit history?

wesm · 2019-01-05T05:49:17Z

The problem is that this patch started before the Parquet monorepo merge. Something about the merged repo history confuses rebase when the base commit is before the merge

trxcllnt · 2019-01-05T06:26:50Z

@wesm we can drop the the commit history, I have it all in my fork if I need something. It went through a bunch of iterations, so a lot of the code in those commits has been factored out at this point.

Change-Id: I423b49d58842a88b8a26c7fa646ed4771f9e31a0

wesm · 2019-01-05T18:21:30Z

Done. Please git reset --hard to that commit

wesm · 2019-01-05T18:26:57Z

Let me know when this is merge ready and I'll take a glance through as I'm able. I figure a lot of feedback / improvements will be handled in follow up PRs

trxcllnt · 2019-01-05T22:06:15Z

Thanks @wesm, I pulled and it all looks good. We can do feedback in followup PRs or discussions on the mailing list. Let me know if I can help point out where things are, or if you have any questions about what's going on.

The generated-data-tests are a great place to start. In addition to comprehensive tests across all the kinds of I/O primitives we want to support, I've tried to add as many real-world scenarios as possible (like reading and writing multiple tables to the same stream), which come from use cases at Graphistry. If you're curious about how the I/O tests work, check out ipc/helpers.ts. Those are the helper functions to generate and validate each I/O primitive.

Here's a checklist to build and test locally:

install/update to node v11.x (nvm install node will update to the latest)
$ cd arrow/js && rm -rf node_modules && npm install to ensure libraries are up to date
$ npx gulp will clean, build, and test everything

While there are always things to improve, I feel good this is ready to merge. I've been updating the Graphistry codebase to use the new APIs with no issues, and I've started work (tweet) on libraries to integrate with other tools in the node ecosystem.

TheNeuralBit · 2019-01-07T15:43:19Z

Thanks for this Paul!
I've just been going through some smoke tests for now and everything looks good to me so far. I ran the benchmarks and noticed some changes. I posted the output from a run on master and a run on the refactor branch here.

Some notable changes (master -> refactor):
Accessing numeric vector by index - (49.81 ms -> 81.31 ms)
Accessing dictionary-encoded utf8 vector by index - (2758.32 ms -> 824.92 ms) 👍
Iterating over dictionary-encoded utf8 vector - (3152 ms -> 583.7 ms) 👍
slice toArray numeric vector - (1.34 ms -> 1.1 ms) 👍
slice toArray dictionary-encoded utf8 vector - (3147.78 ms -> 659.42 ms) 👍
DataFrame countBy dictionary-encoded utf8 vector - (9.95 ms -> 17.24 ms)
DataFrame filter-scan dictionary-encoded utf8 vector - (25.62ms -> 31.9ms)

Interestingly there seems to be a huge improvement in anything that decodes utf8 data. Is that because we're using Buffer.fromString to decode utf8 in node now?

There are some minor regressions elsewhere, but nothing to be too concerned about. Might be worth trying to tackle in some follow-up PRs though.

wesm · 2019-01-07T17:53:30Z

@TheNeuralBit if you could let me know when you're +1 on merging this, I can go ahead. I am swamped this week so don't want to hold this up, and I will do my best to leave some comments when I can for post-merge follow up work

trxcllnt · 2019-01-07T19:47:51Z

@TheNeuralBit: Interestingly there seems to be a huge improvement in anything that decodes utf8 data. Is that because we're using Buffer.fromString to decode utf8 in node now?

Yep, node's Buffer is implemented in C++, so I try to use that if available. Theoretically browsers should be able to do something similar to node with the TextEncoder, but I haven't validated it yet.

For the chunked-random-access tests, I was expecting to take a bit of a hit, since the binary-search in ChunkedVector#search() isn't inlined anymore and incurs a few more hash lookups + a function call.

But before we make that change, I was thinking we could probably see a boost by making our search a bit more intelligent too. If we pick the starting chunk as the search index divided by the average chunk length (discarding the shortest and longest), then instead of starting at 0 every time and almost guaranteeing log(n) lookup, we'd often find the chunk on the first few iterations.

There's also probably a bit of a hit due to the vector method rebinding starting here, as well as the getters for each buffer being pass-throughs back to the Vector's Data instance, instead of copying them onto the Vector on startup.

I tried to bind them as closely to the Vector instance as possible (at the bottom of src/vector/index.ts, where I bind again on vector initialization), but we may have to go further and update the Visitors that need to be super fast (like the GetVisitor) to operate directly on Data instances instead of Vectors. I think I made sure the Visitor typings would allow that, but we'll just have to update the implementations.

DictionaryVectors are going to be a bit trickier. When we see a DictionaryBatch with the isDelta flag set, there may exist DictionaryVectors who need to be told about the new concatenated dictionary values (for example, someone may update the indices, or concat two DictionaryVectors together). Even trickier, Vectors at different levels in the Schema tree may reference the same dictionary, which means we can't just scan the children of the RecordBatch or Table to know which ones to update (or serialize, which was the source of bug ARROW-3337).

But since the Dictionary DataType only exists once per dictionary id, and the Vector hierarchy matches the Schema Field hierarchy, the solution I came up with was to store both the dictionary value DataTypes (Utf8, etc.) and dictionary-encoded Fields (along with their Dictionary wrapper pseudo-DataTypes) in the Schema, and elevate the dictionary values reference to hang off the Dictionary wrapper DataType.

When we're reading or writing, we keep track of a the singleton dictionary values DataType, to ensure it's always set as the T in a Dictionary<T> for each new dictionary-encoded Field. As we read each new or isDelta DictionaryBatch, load (or load-and-concat) the dictionary values Vector, assigning it to each of the Dictionary wrapper instances of each dictionary-encoded Field.

trxcllnt · 2019-01-07T20:25:55Z

Oh I forgot, another thing probably affecting the results: Table#getColumnAt() always returns a Column (which is Chunked), even if there's only a single chunk.

The Column class is a pair of <Field<T>, Vector<T>[]>, which lets us access properties from the Field via the returned child (table.getColumn('foo').name === 'foo'). I was thinking for this case we could have a special SingleChunkColumn that skips the search() method entirely and passes through to the underlying Vector, or perhaps be even dirtier and dynamically bind the Field properties to a clone of the Chunk and fake the return type with an interface.

… and getters in favor of direct property accesses

…iingify(undefined) === undefined)

…s a bit easier

trxcllnt · 2019-01-09T19:16:26Z

@TheNeuralBit I did a pass last night on performance -- less binding, fewer getters, stuff like that. Here's the results I'm seeing now:

                                  test name |   master (ops/sec) |  branch (ops/sec) | % change   |
Table.from                                  |             914.00 |            624.00 |  -31.73% ❌ |
readBatches                                 |           1,113.00 |            623.00 |  -44.03% ❌ |
lat.get(idx)                                |           19.16.00 |             19.22 |    0.31% ➖ |
lng.get(idx)                                |           19.24.00 |             19.05 |   -0.99% ➖ |
origin.get(idx)                             |               0.50 |              1.12 |  124.00% ✔️ |
destination.get(idx)                        |               0.52 |              1.07 |  105.77% ✔️ |
lat[Symbol.iterator]()                      |              16.69 |             16.14 |   -3.30% ➖ |
lng[Symbol.iterator]()                      |              16.72 |             17.10 |    2.27% ✔️ |
origin[Symbol.iterator]()                   |               0.51 |              1.11 |  117.65% ✔️ |
destination[Symbol.iterator]()              |               0.53 |              1.04 |   96.23% ✔️ |
lat.slice().toArray()                       |           1,381.00 |          1,520.00 |   10.07% ✔️ |
lng.slice().toArray()                       |           1,568.00 |          1,426.00 |   -9.06% ➖ |
origin.slice().toArray()                    |               0.49 |              1.12 |  128.57% ✔️ |
destination.slice().toArray()               |               0.48 |              1.10 |  129.17% ✔️ |
lat.slice()                                 |          27,965.00 |         48,678.00 |   74.07% ✔️ |
lng.slice()                                 |          29,425.00 |         42,739.00 |   45.25% ✔️ |
origin.slice()                              |           5,092.00 |         41,802.00 |  720.93% ✔️ |
destination.slice()                         |           4,814.00 |         48,367.00 |  904.72% ✔️ |
DF CountBy [origin]                         |             153.00 |            285.00 |   86.27% ✔️ |
DF CountBy [destination]                    |             153.00 |            262.00 |   71.24% ✔️ |
DF Filter-Scan-Count [lat > 0]              |              30.00 |             35.19 |   17.30% ✔️ |
DF Filter-Scan-Count [lng > 0]              |              29.81 |             34.75 |   16.57% ✔️ |
DF Filter-Scan-Count [origin === "Seattle"] |              59.68 |             56.08 |   -6.03% ➖ |
DF Count [lat > 0]                          |             121.00 |            180.00 |   48.76% ✔️ |
DF Count [lng > 0]                          |             119.00 |            182.00 |   52.94% ✔️ |
DF Count [origin === "Seattle"]             |               0.55 |              1.26 |  129.09% ✔️ |

TheNeuralBit · 2019-01-11T15:31:26Z

Wow! Thanks for tackling the perf stuff - Those numbers look great! I re-ran the benchmarks on my laptop and found a similar improvement 🎉

I don't think I'll have time for a thorough code review in the near future, but I've ran your branch through various tests to make sure the features I'm interested in still work, so I'm 👍 on this.

LGTM @wesm

…bject for creating Data instances

trxcllnt · 2019-01-13T07:54:00Z

And after the latest pass, I got the numbers for the two parse tests back up:

                      test name |   master (ops/sec) |  branch (ops/sec) | % change   |
Table.from                      |             914.00 |             1,549 |   64.61% ✔️ |
readBatches                     |           1,113.00 |             1,671 |   50.13% ✔️ |

wesm · 2019-01-13T19:34:47Z

Merging this. Can you resolve all the relevant JIRAs? Thank you!

trxcllnt · 2019-01-13T20:11:44Z

thanks @wesm!

trxcllnt requested a review from TheNeuralBit December 31, 2018 10:51

Consolidated JS data handling refactor

aaf42c8

Change-Id: I423b49d58842a88b8a26c7fa646ed4771f9e31a0

wesm force-pushed the js-data-refactor branch from 8fda475 to aaf42c8 Compare January 5, 2019 18:21

add getters for the dictionary and indices of chunked dictionary vectors

f7d2b2e

fix lint

6bcaad6

trxcllnt added 9 commits January 9, 2019 09:14

add a single-chunk column type

380dbc7

perf: use a closure instead of binding

47f0677

use the textencoders from the global instead of Buffer for perf testing

78ecc4c

move stride to data, fix chunked slicing, remove intermediate binding…

708f1b4

… and getters in favor of direct property accesses

don't print comma that includes system paths

d165413

fix visitor method overload type signatures

4b58bde

fix bug where valueToString function would return undefined (JSON.str…

62578b9

…iingify(undefined) === undefined)

add a small BigNum mixin to make working with Int64 and Decimal value…

ec12cdd

…s a bit easier

fix lint

ca0db9e

teach closure about Symbol.toPrimitive

64dc015

trxcllnt added 2 commits January 9, 2019 14:57

fix arrow2csv left-pad measurement for new bignum/decimal output

162c7d8

fix Float64 Array typings

8dc5d2c

trxcllnt added 3 commits January 12, 2019 21:04

ensure schema and field always have a metadata map

f006a26

pipe directly to stdout to ensure backpressure is preserved

2cad760

perf: defer creating rowProxy on nested types, use Array instead of O…

6a97ee0

…bject for creating Data instances

trxcllnt added 2 commits January 13, 2019 01:06

handle the case where collapsed Uint8Arrays fully overlap

9acfaa3

bind getByteWidth to the vector type

2ef150f

wesm changed the title ~~[JS]: Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies~~ ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies Jan 13, 2019

wesm closed this in 5598d2f Jan 13, 2019

asfimport mentioned this pull request Jan 13, 2019

[JS] Refactor Vector Data classes #19205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

trxcllnt commented Dec 31, 2018 •

edited

Loading

codecov-io commented Jan 2, 2019 •

edited

Loading

xhochy commented Jan 4, 2019

trxcllnt commented Jan 4, 2019

xhochy commented Jan 5, 2019

wesm commented Jan 5, 2019

wesm commented Jan 5, 2019

trxcllnt commented Jan 5, 2019

wesm commented Jan 5, 2019

wesm commented Jan 5, 2019

trxcllnt commented Jan 5, 2019 •

edited

Loading

TheNeuralBit commented Jan 7, 2019

wesm commented Jan 7, 2019

trxcllnt commented Jan 7, 2019 •

edited

Loading

trxcllnt commented Jan 7, 2019 •

edited

Loading

trxcllnt commented Jan 9, 2019

TheNeuralBit commented Jan 11, 2019

trxcllnt commented Jan 13, 2019

wesm commented Jan 13, 2019

trxcllnt commented Jan 13, 2019

ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

Conversation

trxcllnt commented Dec 31, 2018 • edited Loading

The JIRAs

The stats

The fixes

The upgrades

The breaking changes

The tests

The build

Misc

Todos

codecov-io commented Jan 2, 2019 • edited Loading

Codecov Report

xhochy commented Jan 4, 2019

trxcllnt commented Jan 4, 2019

xhochy commented Jan 5, 2019

wesm commented Jan 5, 2019

wesm commented Jan 5, 2019

trxcllnt commented Jan 5, 2019

wesm commented Jan 5, 2019

wesm commented Jan 5, 2019

trxcllnt commented Jan 5, 2019 • edited Loading

TheNeuralBit commented Jan 7, 2019

wesm commented Jan 7, 2019

trxcllnt commented Jan 7, 2019 • edited Loading

trxcllnt commented Jan 7, 2019 • edited Loading

trxcllnt commented Jan 9, 2019

TheNeuralBit commented Jan 11, 2019

trxcllnt commented Jan 13, 2019

wesm commented Jan 13, 2019

trxcllnt commented Jan 13, 2019

trxcllnt commented Dec 31, 2018 •

edited

Loading

codecov-io commented Jan 2, 2019 •

edited

Loading

trxcllnt commented Jan 5, 2019 •

edited

Loading

trxcllnt commented Jan 7, 2019 •

edited

Loading

trxcllnt commented Jan 7, 2019 •

edited

Loading