GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
THIS PR IS NOT READY YET:
.get
vs[index]
, and maybe even the nuance of how.get
can use a binary search algorithm when theVector
consists of multiple chunks.filter
on theVector
object to ensure 100% compatibility with JavaScript arrays.Rationale for this change
Certain codebases that previously uses row-oriented way to access data may wish to migrate to Arrow to save serialization and deserialization cost, and to be able to gain access to fast column-oriented operations. As it stands, Arrow is sort of a drop-in replacement to row-oriented data such as a JavaScript Array of objects. This is great to incrementally migrate legacy codebases to Arrow, as it is frequently infeasible to rewrite the application to use the column-oriented data access patterns. For most data, JavaScript-object-compatible and row-oriented access is already provided via the
StructRowProxy
. However, if the structs themselves include aVector
, existing code will break as it assumes theVector
object to behave like a JavaScript array, which it does not due to the lack of index access. An example of such a data structure is as follows:In this case, with the Arrow JS library as it is, the API consumer is unable to get individual element of the
y
array viatable[i].y[j]
. Instead, the API consumer must use the APItable.get(i).y.get(j)
. In the situation where we are migrating a legacy code base to Arrow, this requires a large refactor of the entire codebase, which is infeasible in a short time. This negates the advantage of using Arrow as a drop-in replacement and prevents incremental migration of code to Arrow.What changes are included in this PR?
To address this problem, this patch adds a Proxy at the root of the prototype chain for the
Table
,RecordBatch
, andVector
objects and allow index access for these objects for backward compatibility purposes. Basically, objects likeVector
now supportsvector[i]
in addition tovector.get(i)
.However, code should not be using
vector[i]
as it is ~1.5 orders of magnitude slower thanvector.get(i)
as ES6 Proxy objects are quite slow. This should only be used to provide compatibility for legacy codebases. For code that desires high performance,.get
remains a much better solution. This is also why the Proxy object is added to the root of the prototype chain, as opposed to the usual pattern where a Proxy object is returned from a constructor.Documentation has been added to compare the performance of the various access.
Are there any user-facing changes?
Table
,RecordBatch
, andVector
elements can now be accessed via index operators, albeit much slower than.get
. All changes should be backward compatible.Are these changes tested?
Yes. The performance of the base objects does not seem to be affected. To establish the performance change, we first see how much variability in the
ops/s
there is between two runs ofyarn perf
. The left plot below shows the percent difference between the two runs. This shows a baseline level of "variability" (although may not be statistically significant, but gives an indication) between the runs. We see most benchmarks are within 20% of each other, although a few tests at the bottom show higher variability. The right plot shows the difference inops/s
before and after the patch. No benchmark showed significantly higher or lower performance and looks superficially similar to the variability comparison. As such, we can say that the performance of the objects are not impacted by this patch.The following shows the performance of
vector.get(i)
vsvector[i]
. The raw results shows the index access to have about 50 ops/s while.get
are around 750 ops/s. This is about 1.5 orders of magnitude difference in performance. Basically, this shouldn't be used unless it is used for backward compatibility with code that expects native JavaScript arrays.Table
,RecordBatch
, andVector
#34936