-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3290 +/- ##
==========================================
- Coverage 88.57% 88.46% -0.11%
==========================================
Files 540 631 +91
Lines 73106 79286 +6180
Branches 0 1069 +1069
==========================================
+ Hits 64753 70140 +5387
- Misses 8250 9031 +781
- Partials 103 115 +12
Continue to review full report at Codecov.
|
@trxcllnt This seems to include some unwanted commits. Can you rebase on latest master? |
@xhochy I've tried doing |
@trxcllnt Also had a look at this and I don't know what's going wrong here. Really puzzled. |
I can look at fixing this tomorrow. How much do you care about maintaining the commit history? |
The problem is that this patch started before the Parquet monorepo merge. Something about the merged repo history confuses rebase when the base commit is before the merge |
@wesm we can drop the the commit history, I have it all in my fork if I need something. It went through a bunch of iterations, so a lot of the code in those commits has been factored out at this point. |
Change-Id: I423b49d58842a88b8a26c7fa646ed4771f9e31a0
Done. Please |
Let me know when this is merge ready and I'll take a glance through as I'm able. I figure a lot of feedback / improvements will be handled in follow up PRs |
Thanks @wesm, I pulled and it all looks good. We can do feedback in followup PRs or discussions on the mailing list. Let me know if I can help point out where things are, or if you have any questions about what's going on. The Here's a checklist to build and test locally:
While there are always things to improve, I feel good this is ready to merge. I've been updating the Graphistry codebase to use the new APIs with no issues, and I've started work (tweet) on libraries to integrate with other tools in the node ecosystem. |
Thanks for this Paul! Some notable changes (master -> refactor): Interestingly there seems to be a huge improvement in anything that decodes utf8 data. Is that because we're using Buffer.fromString to decode utf8 in node now? There are some minor regressions elsewhere, but nothing to be too concerned about. Might be worth trying to tackle in some follow-up PRs though. |
@TheNeuralBit if you could let me know when you're +1 on merging this, I can go ahead. I am swamped this week so don't want to hold this up, and I will do my best to leave some comments when I can for post-merge follow up work |
Yep, node's Buffer is implemented in C++, so I try to use that if available. Theoretically browsers should be able to do something similar to node with the TextEncoder, but I haven't validated it yet. For the chunked-random-access tests, I was expecting to take a bit of a hit, since the binary-search in But before we make that change, I was thinking we could probably see a boost by making our search a bit more intelligent too. If we pick the starting chunk as the search index divided by the average chunk length (discarding the shortest and longest), then instead of starting at 0 every time and almost guaranteeing There's also probably a bit of a hit due to the vector method rebinding starting here, as well as the getters for each buffer being pass-throughs back to the Vector's I tried to bind them as closely to the Vector instance as possible (at the bottom of DictionaryVectors are going to be a bit trickier. When we see a DictionaryBatch with the But since the When we're reading or writing, we keep track of a the singleton |
Oh I forgot, another thing probably affecting the results: The |
… and getters in favor of direct property accesses
…iingify(undefined) === undefined)
@TheNeuralBit I did a pass last night on performance -- less binding, fewer getters, stuff like that. Here's the results I'm seeing now:
|
Wow! Thanks for tackling the perf stuff - Those numbers look great! I re-ran the benchmarks on my laptop and found a similar improvement 🎉 I don't think I'll have time for a thorough code review in the near future, but I've ran your branch through various tests to make sure the features I'm interested in still work, so I'm 👍 on this. LGTM @wesm |
And after the latest pass, I got the numbers for the two parse tests back up:
|
Merging this. Can you resolve all the relevant JIRAs? Thank you! |
thanks @wesm! |
It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. Check out this sweet gif of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like uber/deck.gl 🎉
The JIRAs
In addition to everything I detail below, this PR closes the following JIRAs:
The stats
The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0:
The fixes
Vector#indexOf(value)
works for all DataTypesVector#set(i, value)
now works for all DataTypesisDelta
now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrivedarrow2csv
fixes:stdin
if it's a TTYstdin
help
text when we don't understand the inputhead
andless
The upgrades
RecordBatchReader.from()
will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSONRecordBatchFileReader
now supports random-access seek, enabling more efficient web-worker/multi-process workflowsRecordBatchStreamReader
can now read multiple tables from the same underlying socketMessageReader
now guarantees/enforces message body byte alignment (this one even surfaced bugs in node core and the DOM streams polyfill)RecordBatchWriter.writeAll()
method to easily write a Table or stream of RecordBatchesIterable<Buffer>
, orAsyncIterable<Buffer>
ReadableStream
,fs.FileHandle
ReadableStream
orReadableByteStream
, or theResponse
returned from thefetch()
API. (Wrapping the FileReader is still todo)throughNode()
/throughDOM()
toReadableNodeStream()
/toReadableDOMStream()
pipe()
/pipeTo()
/pipeThrough()
DataType
now flow recursivelyData
classVisitor
class with support for optional, more narrowvisitT
implementationsChunked
base class for the applicative (concat) operationchunkedInst.chunks
field is the list of inner chunksColumn
class extendsChunked
, combinesField
with the chunks (provides access to the fieldname
from the Schema)RecordBatch#concat(...batchesOrTables)
now returns a TableChunked
, so it inherits:Table#slice(from, to)
Table#concat(...batchesOrTables)
Table#getChildAt(i)
exists, alias ofgetColumnAt(i)
Table#getColumn[At]()
returns a ColumnThe breaking changes
Table#batches
is nowTable#chunks
, which it inherits fromChunked
(maybe controversial, open to aliasing)Table#batchesUnion
is now just... the Table instance itself (also maybe controversial, open to aliasing)DataType#TType
is nowDataType#typeId
-- it should have always been this, was a typo. Easy to alias if necessary.Visitors
The tests
bin/integration.js
, and they finish much quickerRecordBatchJSONWriter
has been implemented so we can easily debug and validate written outputmemfs
to mock the file system, which contributes to test performance improvementsThe build
Symbol.asyncIterator
enabled)Misc
arrow2csv
tojs/bin/arrow2csv
, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (cat foo.arrow | js/bin/arrow2csv.js
)Todos
arrow2csv
)