-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1693: [JS] Expand JavaScript implementation, build system, fix integration tests #1294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Want to go ahead and ignore the vector layout metadata? Per ARROW-1785. I will wait a day or so for further feedback to circulate, then proceed with a removal of this metadata. We'll need to update the Flatbuffers files in JS again as part of this, I will give you write access on my fork so you can push directly to the PR branch as needed |
|
@wesm yep, working on this tonight. do you mind if I add to this PR vs starting a new branch? |
|
no problem, feel free to keep adding here. I will take a little time to review also |
bf48cda to
00e54b5
Compare
|
@wesm I believe this branch is ready. We should revalidate datetime and dictionary once we can get Pandas to generate a CSV from the test data. The integration tests are finally passing, so I feel good about this one: The last commit re-enables node_js job in travis so we can verify the above. |
64e318c to
e7ae4c7
Compare
|
awesome, thanks @trxcllnt! I'm going to work through the patch to leave any comments that jump out but this is really exciting |
js/gulp/test-task.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this version number here be gotten from the environment / pom file per chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this in my response below
js/gulp/test-task.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple way to make this file path more easily configurable / less hard-coded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this in my response below
js/npm-release.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: what we're going to want to do for ASF release purposes: one script to produce a tarball of the JS project that is sufficient to publish to NPM after. Then a script in the tarball that can publish the project to NPM. So the ASF signed artifact that we upload to SVN will have everything that's needed to publish the project to NPM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm That makes sense. The way I have things set up, we compile and publish multiple modules to npm:
- one large-ish module that you can get via
npm install apache-arrow - the rest as smaller/specialized modules under the
@apache-arrownpm organization, which can be installed via the formulanpm install @apache-arrow/<target-name>. For example,npm install @apache-arrow/es5-cjsinstalls the slimmed down ES5/CommonJS target
The npm run build command compiles all the output targets to the (gitignored) targets directory. The lerna publish --yes --skip-git --cd-version $bump --force-publish=* command publishes all the targets to npm. So from the sound of it, all we need to do is tar up the targets directory with a shell script that installs and runs lerna publish, and we're good to go? If so, I can do that tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. The other side of this is creating the signed tarball for voting, and making sure the tarball is sufficient for the post-release upload to NPM. We'll need to copy some files from the root directory (like the license and notice files). I can help with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build also copies extra files from the js folder into each of the packages, so we just need to change this line to ['../LICENSE.txt', '../NOTICE.txt', 'README.md']. Do we also need to add the info in js/LICENSE to the top-level notice.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what's in js/LICENSE should go in the top level LICENSE.txt at the bottom. Then we can copy that one license into the JS tarball
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm done
js/src/reader/arrow.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you recommend as a resource for getting up to speed on TypeScript? Where is the line between TypeScript and ES6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm anything type-related (type annotations, interfaces, generics, and declarations like type = { foo: string }) is TypeScript, the rest is ES. If you're curious about an individual feature, you can run the build (npm run build) and compare the transpiled output (in the targets directory) with the TS source. We transpile to multiple JS versions and module formats, but it's probably easiest to compare against the targets/es2015/esm or targets/esnext/esm.
TS code-gens/polyfills missing features depending on the target environment. For example, ES5 doesn't have generators, so TS code-gens an iterator state machine into generator functions, but the ES2015 target leaves them as-is. And ES2015 doesn't have async iterator functions, so TS coge-gens the async iterator state machine with Promises, but the ESNext target leaves them as-is.
js/test/arrows/json/datetime.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about checking in all these .json and .arrow files, is there some way we can automate their generation as part of the integration testing? Then they don't have to be modified when we expand the integration test suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm yes, that would be ideal.
I generated the JSON from the python integration tests like this (click to expand)
from integration_test import generate_nested_case
from integration_test import generate_decimal_case
from integration_test import generate_datetime_case
from integration_test import generate_primitive_case
from integration_test import generate_dictionary_case
generate_nested_case().write("../js/test/arrows/json/nested.json")
generate_decimal_case().write("../js/test/arrows/json/decimal.json")
generate_datetime_case().write("../js/test/arrows/json/datetime.json")
generate_dictionary_case().write("../js/test/arrows/json/dictionary.json")
generate_primitive_case([7, 10]).write("../js/test/arrows/json/primitive.json")
generate_primitive_case([0, 0, 0]).write("../js/test/arrows/json/primitive-empty.json")Then I had to manually edit the "primitive.json" file to remove the "binary_nullable" and "binary_nonnullable" columns, because the C++ command fails if they're present (click to expand)
$ ../cpp/build/release/json-integration-test \
--integration --mode=JSON_TO_ARROW \
--json=./test/arrows/json/primitive.json \
--arrow=./test/arrows/cpp/file/primitive.arrow
Found schema: bool_nullable: bool
bool_nonnullable: bool not null
int8_nullable: int8
int8_nonnullable: int8 not null
int16_nullable: int16
int16_nonnullable: int16 not null
int32_nullable: int32
int32_nonnullable: int32 not null
int64_nullable: int64
int64_nonnullable: int64 not null
uint8_nullable: uint8
uint8_nonnullable: uint8 not null
uint16_nullable: uint16
uint16_nonnullable: uint16 not null
uint32_nullable: uint32
uint32_nonnullable: uint32 not null
uint64_nullable: uint64
uint64_nonnullable: uint64 not null
float32_nullable: float
float32_nonnullable: float not null
float64_nullable: double
float64_nonnullable: double not null
binary_nullable: binary
binary_nonnullable: binary not null
utf8_nullable: string
utf8_nonnullable: string not null
Error message: Invalid: Encountered non-hex digitThe unit tests rely heavily on snapshot testing to validate the actual values in the vectors. I manually validated the data in the snapshots against the buffers using pyarrow and pandas, but that approach won't scale. Typically the snapshot files get checked into version control, but now that we have 11k snapshots, the snapshot files are around 21mb. I removed them from the repo b/c we don't want huge files. Now the CI server generates the snapshots once up front, then validates the compilation targets against those.
This will catch any cases where compiling the JS to different targets leads to failures (e.g. if the minifiers mangle names they weren't supposed to), but since we're not checking in the snapshot files, the CI server won't be able to tell us if a new PR causes a snapshot test to break. We can know that if we run the tests locally, but we can't rely us running the tests for each PR locally before merging.
I'm a bit torn here. On the one hand, I don't want to check in 21mb worth of tests to source control. On the other hand, I don't want to hand-write the 11k assertions that the snapshot tests represent (and would also presumably be many-MBs worth of tests anyway).
I believe git compresses files across the network? And if space-on-disk is an issue, I could add a post-clone script to automatically compress the snapshot files after checkout (about 3mb gzipped). Jest doesn't work with compressed snapshot files out of the box, but I could add some steps to the test runner to decompress the snapshots before running.
To your point about using the C++/Java writers to convert the JSON to Arrow buffers on the fly, we should 100% do that. This PR is marginally better since we can at least regenerate the arrow files easily enough, but ideally we don't have them at all and we can pipe them to the node process on the fly, or at a minimum, write to files then clean up after. We'll want a mode for local dev that skips this step, as incurring the JVM overhead to convert JSON to Arrow files is painful for debugging.
I left the code in there (commented out) to draw attention to this idea. I was on a plane when I checked this in and didn't have bandwidth to debug the CI scripts, otherwise I would have. We can pass the paths to the C++ and Java executables as CLI and/or env variables no problem.
I wasn't sure if the executables were already available in CI, or whether I would need to build them before running the JS tests. Building the C++ and Java projects in order to run the JS tests doesn't seem ideal, but I'm fine doing it if we have to.
js/test/integration-tests.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by these integration tests, maybe I'm missing something. Shouldn't JavaScript be reading the JSON and comparing what it thinks are the contents of the JSON with the binary coming from Java and C++? Here it looks like it is reading the C++ and Java binary files and comparing them with each other rather than its own version of the truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I think for this PR the goal was more to just replace the binary files in the repo with generated files, and adding tests to make sure our reader works the same for both C++ and Java generated files is just an added bonus. It might be a bit of a misnomer to call it an integration test.
I think the next step will be to add a JSON reader/writer and an Arrow IPC writer so we can integrate with the actual integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think here we should comment exactly what's being tested (since these tests will pass if (and only if?) the Java <-> C++ integration tests are passing). If we can run these particular JS tests only in the Travis CI entry where we are already running the integration_test.py then we don't need to be checking in binary/JSON files to git
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm yes they should. @TheNeuralBit is right, the goal here was to get the JS version reading C++ and Java arrows identically, and this test is how I validate that it does. Right now it doesn't validate that it reads correctly (that's what the other tests do), just that there's no difference between the two. I think the plan is to update this test when we do have the JSON reader/writer in JS
TheNeuralBit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's awesome to see support for so many more types, and consistency when reading both C++ and Java-generated files 🎉
js/src/vector/arrow.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move the calls to nullableMixin and fieldMixin from here and out to each individual call? Are there some subtle differences in some vectors that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit yeah so the ES6 class spec states that the name property of a class constructor is immutable (and they're also not allowed to be computed properties, like let x = 'myClass'; class [x] extends Foo {}). And anonymous class names default to "Object", reading as Object { data: Int32Array } instead of Int32Vector when debugging. While this is ugly and hard to scale if we want to add more mixin behaviors, I figure it's a win for anyone using the library in the real world to see descriptive class names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense, I figured there must be a good reason. This seems like a great application for some kind of preprocessor with macros or a code generator... but I don't know of any that would integrate well with JS build/dev tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit yeah we could use babel-codegen, babel-preval, or babel-macros if we want. I was hoping to avoid babel if possible, but since we're webpacking the es2015+ UMD bundles anyway, it wouldn't be too much of a headache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit but if we do want to do more compilation steps beyond what the TS compiler does, it'd be neat to also run prepack on the flatbuffers generated code
js/test/integration-tests.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I think for this PR the goal was more to just replace the binary files in the repo with generated files, and adding tests to make sure our reader works the same for both C++ and Java generated files is just an added bonus. It might be a bit of a misnomer to call it an integration test.
I think the next step will be to add a JSON reader/writer and an Arrow IPC writer so we can integrate with the actual integration tests.
js/gulpfile.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should createTestData and cleanTestData be uncommented so we can remove the arrow files from the repo? I'm thinking these are probably commented now so that other contributors will be able to run the tests without building the Java and C++ impls - if that's the case, maybe we should separate out integration tests, which require the other libraries, and unit tests, which can be run stand-alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit yes, definitely. I put these in to remind us to generate test data on the fly (and how to do it) for the tests, and remove the arrow files from the test directory. Making sure the CI environment had the C++ and Java libs built before the JS tests run was a bit more than I could bite off yesterday on my flight home :)
I guess I'm not quite understanding what snapshot tests accomplish here that normal array comparisons would not. In Java and C++ we have functions that compare the contents of arrays. So when you say hand-writing the snapshot test assertions, what's being tested and why is that the only way to test that behavior? Is there a concern that a programmatic comparison like might not be as strong of an assertion as a UI-based test (what the values from the arrays would actually appear as in the DOM)? Having the possibility of a single PR bloating the git history by whatever the snap files gzip down to doesn't seem like a good idea. Even having large diffs as the result of automatically generated files on commit isn't ideal |
|
@wesm the Jest docs on snapshot testing highlight its utility testing React components, but it's really just a form of test code generation. The tests evaluate all combinations of Snapshots capture a bit of runtime type info that would otherwise have to be asserted explicitly, for example that calling `uint64Vector.get(i)` returns a `Uint32Array` of two elements:They're also helpful catching regressions (or comparing against pandas) in `Table.toString()`:
It also gives reviewers a chance to see what the tests produce, so if |
|
It sounds like this is useful for catching code regressions from PR to PR, but it clashes with my sensibilities about testing data libraries, which is that we should have explicit unit tests asserting the correct behavior. In the example you gave ("Snapshots capture a bit of runtime type info that would otherwise have to be asserted explicitly, for example that calling I'm no expert, so there may be things I'm missing -- are some of the test assertions dependent on the flavor of the deployment target? |
|
Snapshots are just a different, data-centric way of writing assertions. Give it a lot of test data, validate the tests results once, then compare diffs after that. If you can eyeball test results and know whether it works, then the computer codegens all the dreadful bits about comparing types, values, etc. (even when it's late and you might otherwise forget to test an edge case).
No, the assertions should be identical regardless of the compilation target -- they're generated once at the beginning, then all the targets are compared against the same snapshots. I may have mentioned this before, but they've also helped catching minification bugs. Like before when we did return Long instances, Closure Compiler minified the class name down to something like "zw", so the snapshot tests failed for just the ES5/UMD target. But on the whole I can't argue with your position. All I can say is I'm proably pretty lazy by normal standards, so I try to make my computer do as much of my homework as possible. |
|
Maybe another way to phrase it is, disk and network are cheap, my/our time is not. ;-) edit: shit, I misread the snapshot count; we have 113,940 snapshots, not 11,394 |
This "validate the tests results once" part is where I'm getting lost. How do you know whether anything is correct if you don't write down what you expect to be true? I can help with rallying the troops to write more tests. I am a bit bandwidth constrained at the moment with all the 0.8.0 stuff in progress, but I am hopeful that some others can get involved and this will also help with beating on the API and finding rough edges. cc @leifwalsh @scottdraves |
Ah right, in this case the JSON files are the initial source of truth. I compared the snapshots against the Arrow files read via pandas/pyarrow, and it looked correct. After this (assuming stable test data), the snapshots are the source of truth. If we decide to change the test data, then we have to re-validate the snapshots are what we expect them to be. But I want to stress, I'm not against doing it differently. I'm also bandwidth constrained, and snapshots get high coverage with minimal effort. It sounds like the JSON reader should provide all the same benefits as snapshot testing. From that perspective, I see snapshots as a stop-gap until the JS JSON reader is done (unless there's a way we can validate columns with the C++ or Java JSON readers from the JS tests?) With that in mind, I agree it's best not to commit the snapshots to the git history, if we're just going to remove them once the JSON reader is ready. In the interim, I don't mind validating any new JS PR's against my local snapshots, as the volume of JS PR's isn't that high yet. |
|
Yes, definitely, we are in agreement. We should push for a JSON reader ASAP -- following the C++ reader as a guide, I do not think it is that big of a project, to be honest, when you consider all the hardship of dealing with parsing JSON in C++, which is a complete non-issue in JavaScript. |
|
@wesm that sounds good to me. In the meantime can we get this PR merged + finish ASF release scripts, and push a new version to npm? I'm at the point where not having the latest on npm is going to be a problem for projects at work soon, @TheNeuralBit may be feeling this too. |
|
That's good for me. As soon as the release scripts are good to go I can conduct the release vote on the mailing list. We can close the vote in less than the usual 72 hours so long as we get 3 PMC votes. So we'll need a quick "here's how to verify the release candidate" blurb to direct people do when we start the release vote |
|
Does anything still need to be done on this branch? If we are still not close to being able to cut a JS release by early next week I will rearrange my priorities to help out |
|
@wesm nope, only thing left to do is the ASF release scripts I think |
|
@wesm I understand you may be busy, so do you mind if I go ahead and merge this? |
|
Sorry I have been dragging my feet because I’m not really on board with checking in data files that can be generated as part of CI. Per Slack conversation it seems there are some roadblocks so I’m available as needed today and tomorrow to get this sorted out |
|
Sorry that I missed the thing. I will figure out what's going on here:
|
|
This doesn't fail for me locally: |
Change-Id: I2666e81e459581656fe8d02bcec3b1d0f7ccc7c6
|
I found the problem -- one of the primitive integration test files was being clobbered and not run, which was suppressing a failure that should have been raised a long time ago. In the meantime, there was also a regression from the Java refactor, and we are no longer able to fully read unsigned integer types anymore. I will hack the integration tests for now and open a JIRA about fixing, here's an example of trying to read a |
…gned max for bit width Change-Id: Id06eaa8a4071a50a2958f46c41fe1e2ea80c2c82
|
OK, integration tests should pass now, fingers crossed. I will merge once the build is green, thanks @trxcllnt and @TheNeuralBit for the patience, it is appreciated |
9393b42 to
c86dc02
Compare
|
@wesm awesome, thanks! After my latest commit, the integration tests all pass now for me locally. Are you fine with this PR as-is, or should I close it and do one from a new branch w/o the test data commits? |
|
The PR tool should squash out those commits, so I don't think it's a problem. I'll let you know if I run into any issues after the build runs |
c86dc02 to
c72134a
Compare
|
There was a non-deterministic Plasma failure in the C++/Python entry, but the integration test entry also failed: |
|
@wesm yeah, looks like Closure Compiler threw an exception building the ES5 UMD target: https://travis-ci.org/apache/arrow/jobs/304499884#L4649. I'm not certain, but it could be related to the integration tests running with JDK7 instead of 8. I'll switch the job to use the JS version of the Closure Compiler which, while slower, won't be affected by Java externalities. |
|
@wesm actually this might be tough -- the JS version of closure-compiler is a bit outdated and broken, and the Java version hasn't supported Java 7 since October. I don't want to skip running the tests on the ES5 UMD bundle, as that's the lowest-common-denominator for anyone wanting to experiment with Arrow in the browser, and the integration tests validate that public methods and properties don't get minified away. Is it possible to update the integration job to openjdk8 (like the java version)? If not, I can create a sibling |
|
Totally fine to move the integration tests to jdk8 |
…losure-compiler-scripts
Change-Id: Icae98c89f8989c6b33d1e7a16a1c05296211bc9b
0a84b23 to
5660eb3
Compare
|
@wesm alright! now that the integration tests are passing, I added backwards-compatibility for Arrow files written before 0.8, re-enabled the datetime tests, and removed the generated arrow files from the performance tests. Should be good to go pending this last CI build. |
|
Sweet thanks! So after we get the release scripts set up, we can release this, but one problem to be aware of is that the library in its current state cannot read 0.7.1 binary data and possibly not 0.8.0 (in its final form) binary data. Hopefully we can get ARROW-1785 sorted out soon |
|
Ah cool I missed that. I think we are good then, so I suggest we cut a JS release ASAP to make sure we've got the process down and then we can release again after 0.8.0 final goes out. I'm available this week to help out with this |
This PR adds a workaround for reading the metadata layout for C++ dictionary-encoded vectors.
I added tests that validate against the C++/Java integration suite. In order to make the new tests pass, I had to update the generated flatbuffers format and add a few types the JS version didn't have yet (Bool, Date32, and Timestamp). It also uses the new
isDeltaflag on DictionaryBatches to determine whether the DictionaryBatch vector should replace or append to the existing dictionary.I also added a script for generating test arrow files from the C++ and Java implementations, so we don't break the tests updating the format in the future. I saved the generated Arrow files in with the tests because I didn't see a way to pipe the JSON test data through the C++/Java json-to-arrow commands without writing to a file. If I missed something and we can do it all in-memory, I'd be happy to make that change!
This PR is marked WIP because I added an integration test that validates the JS reader reads C++ and Java files the same way, but unfortunately it doesn't. Debugging, I noticed a number of other differences between the buffer layout metadata between the C++ and Java versions. If we go ahead with @jacques-n comment in ARROW-1693 and remove/ignore the metadata, this test should pass too.
cc @TheNeuralBit