Skip to content

Conversation

@trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Nov 9, 2017

This PR adds a workaround for reading the metadata layout for C++ dictionary-encoded vectors.

I added tests that validate against the C++/Java integration suite. In order to make the new tests pass, I had to update the generated flatbuffers format and add a few types the JS version didn't have yet (Bool, Date32, and Timestamp). It also uses the new isDelta flag on DictionaryBatches to determine whether the DictionaryBatch vector should replace or append to the existing dictionary.

I also added a script for generating test arrow files from the C++ and Java implementations, so we don't break the tests updating the format in the future. I saved the generated Arrow files in with the tests because I didn't see a way to pipe the JSON test data through the C++/Java json-to-arrow commands without writing to a file. If I missed something and we can do it all in-memory, I'd be happy to make that change!

This PR is marked WIP because I added an integration test that validates the JS reader reads C++ and Java files the same way, but unfortunately it doesn't. Debugging, I noticed a number of other differences between the buffer layout metadata between the C++ and Java versions. If we go ahead with @jacques-n comment in ARROW-1693 and remove/ignore the metadata, this test should pass too.

cc @TheNeuralBit

@trxcllnt trxcllnt changed the title WIP ARROW-1693: Fix reading C++ dictionary-encoded vectors WIP ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors Nov 9, 2017
@trxcllnt trxcllnt changed the title WIP ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors (WIP) Nov 9, 2017
@wesm
Copy link
Member

wesm commented Nov 9, 2017

Want to go ahead and ignore the vector layout metadata? Per ARROW-1785. I will wait a day or so for further feedback to circulate, then proceed with a removal of this metadata. We'll need to update the Flatbuffers files in JS again as part of this, I will give you write access on my fork so you can push directly to the PR branch as needed

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 9, 2017

@wesm yep, working on this tonight. do you mind if I add to this PR vs starting a new branch?

@wesm
Copy link
Member

wesm commented Nov 9, 2017

no problem, feel free to keep adding here. I will take a little time to review also

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 12, 2017

@wesm I believe this branch is ready. We should revalidate datetime and dictionary once we can get Pandas to generate a CSV from the test data. The integration tests are finally passing, so I feel good about this one:

Test Suites: 4 passed, 4 total
Tests:       404 passed, 404 total
Snapshots:   113940 passed, 113940 total

The last commit re-enables node_js job in travis so we can verify the above.

@trxcllnt trxcllnt force-pushed the generate-js-test-files branch from 64e318c to e7ae4c7 Compare November 12, 2017 17:25
@wesm wesm changed the title ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors (WIP) ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors Nov 12, 2017
@wesm
Copy link
Member

wesm commented Nov 12, 2017

awesome, thanks @trxcllnt! I'm going to work through the patch to leave any comments that jump out but this is really exciting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this version number here be gotten from the environment / pom file per chance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this in my response below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple way to make this file path more easily configurable / less hard-coded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this in my response below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: what we're going to want to do for ASF release purposes: one script to produce a tarball of the JS project that is sufficient to publish to NPM after. Then a script in the tarball that can publish the project to NPM. So the ASF signed artifact that we upload to SVN will have everything that's needed to publish the project to NPM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm That makes sense. The way I have things set up, we compile and publish multiple modules to npm:

  • one large-ish module that you can get via npm install apache-arrow
  • the rest as smaller/specialized modules under the @apache-arrow npm organization, which can be installed via the formula npm install @apache-arrow/<target-name>. For example, npm install @apache-arrow/es5-cjs installs the slimmed down ES5/CommonJS target

The npm run build command compiles all the output targets to the (gitignored) targets directory. The lerna publish --yes --skip-git --cd-version $bump --force-publish=* command publishes all the targets to npm. So from the sound of it, all we need to do is tar up the targets directory with a shell script that installs and runs lerna publish, and we're good to go? If so, I can do that tonight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. The other side of this is creating the signed tarball for voting, and making sure the tarball is sufficient for the post-release upload to NPM. We'll need to copy some files from the root directory (like the license and notice files). I can help with this

Copy link
Contributor Author

@trxcllnt trxcllnt Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build also copies extra files from the js folder into each of the packages, so we just need to change this line to ['../LICENSE.txt', '../NOTICE.txt', 'README.md']. Do we also need to add the info in js/LICENSE to the top-level notice.txt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what's in js/LICENSE should go in the top level LICENSE.txt at the bottom. Then we can copy that one license into the JS tarball

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you recommend as a resource for getting up to speed on TypeScript? Where is the line between TypeScript and ES6?

Copy link
Contributor Author

@trxcllnt trxcllnt Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm anything type-related (type annotations, interfaces, generics, and declarations like type = { foo: string }) is TypeScript, the rest is ES. If you're curious about an individual feature, you can run the build (npm run build) and compare the transpiled output (in the targets directory) with the TS source. We transpile to multiple JS versions and module formats, but it's probably easiest to compare against the targets/es2015/esm or targets/esnext/esm.

TS code-gens/polyfills missing features depending on the target environment. For example, ES5 doesn't have generators, so TS code-gens an iterator state machine into generator functions, but the ES2015 target leaves them as-is. And ES2015 doesn't have async iterator functions, so TS coge-gens the async iterator state machine with Promises, but the ESNext target leaves them as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about checking in all these .json and .arrow files, is there some way we can automate their generation as part of the integration testing? Then they don't have to be modified when we expand the integration test suite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm yes, that would be ideal.

I generated the JSON from the python integration tests like this (click to expand)
from integration_test import generate_nested_case
from integration_test import generate_decimal_case
from integration_test import generate_datetime_case
from integration_test import generate_primitive_case
from integration_test import generate_dictionary_case

generate_nested_case().write("../js/test/arrows/json/nested.json")
generate_decimal_case().write("../js/test/arrows/json/decimal.json")
generate_datetime_case().write("../js/test/arrows/json/datetime.json")
generate_dictionary_case().write("../js/test/arrows/json/dictionary.json")
generate_primitive_case([7, 10]).write("../js/test/arrows/json/primitive.json")
generate_primitive_case([0, 0, 0]).write("../js/test/arrows/json/primitive-empty.json")
Then I had to manually edit the "primitive.json" file to remove the "binary_nullable" and "binary_nonnullable" columns, because the C++ command fails if they're present (click to expand)
$ ../cpp/build/release/json-integration-test \
  --integration --mode=JSON_TO_ARROW \
  --json=./test/arrows/json/primitive.json \
  --arrow=./test/arrows/cpp/file/primitive.arrow
Found schema: bool_nullable: bool
bool_nonnullable: bool not null
int8_nullable: int8
int8_nonnullable: int8 not null
int16_nullable: int16
int16_nonnullable: int16 not null
int32_nullable: int32
int32_nonnullable: int32 not null
int64_nullable: int64
int64_nonnullable: int64 not null
uint8_nullable: uint8
uint8_nonnullable: uint8 not null
uint16_nullable: uint16
uint16_nonnullable: uint16 not null
uint32_nullable: uint32
uint32_nonnullable: uint32 not null
uint64_nullable: uint64
uint64_nonnullable: uint64 not null
float32_nullable: float
float32_nonnullable: float not null
float64_nullable: double
float64_nonnullable: double not null
binary_nullable: binary
binary_nonnullable: binary not null
utf8_nullable: string
utf8_nonnullable: string not null
Error message: Invalid: Encountered non-hex digit

The unit tests rely heavily on snapshot testing to validate the actual values in the vectors. I manually validated the data in the snapshots against the buffers using pyarrow and pandas, but that approach won't scale. Typically the snapshot files get checked into version control, but now that we have 11k snapshots, the snapshot files are around 21mb. I removed them from the repo b/c we don't want huge files. Now the CI server generates the snapshots once up front, then validates the compilation targets against those.

This will catch any cases where compiling the JS to different targets leads to failures (e.g. if the minifiers mangle names they weren't supposed to), but since we're not checking in the snapshot files, the CI server won't be able to tell us if a new PR causes a snapshot test to break. We can know that if we run the tests locally, but we can't rely us running the tests for each PR locally before merging.

I'm a bit torn here. On the one hand, I don't want to check in 21mb worth of tests to source control. On the other hand, I don't want to hand-write the 11k assertions that the snapshot tests represent (and would also presumably be many-MBs worth of tests anyway).

I believe git compresses files across the network? And if space-on-disk is an issue, I could add a post-clone script to automatically compress the snapshot files after checkout (about 3mb gzipped). Jest doesn't work with compressed snapshot files out of the box, but I could add some steps to the test runner to decompress the snapshots before running.

To your point about using the C++/Java writers to convert the JSON to Arrow buffers on the fly, we should 100% do that. This PR is marginally better since we can at least regenerate the arrow files easily enough, but ideally we don't have them at all and we can pipe them to the node process on the fly, or at a minimum, write to files then clean up after. We'll want a mode for local dev that skips this step, as incurring the JVM overhead to convert JSON to Arrow files is painful for debugging.

I left the code in there (commented out) to draw attention to this idea. I was on a plane when I checked this in and didn't have bandwidth to debug the CI scripts, otherwise I would have. We can pass the paths to the C++ and Java executables as CLI and/or env variables no problem.

I wasn't sure if the executables were already available in CI, or whether I would need to build them before running the JS tests. Building the C++ and Java projects in order to run the JS tests doesn't seem ideal, but I'm fine doing it if we have to.

Copy link
Member

@wesm wesm Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by these integration tests, maybe I'm missing something. Shouldn't JavaScript be reading the JSON and comparing what it thinks are the contents of the JSON with the binary coming from Java and C++? Here it looks like it is reading the C++ and Java binary files and comparing them with each other rather than its own version of the truth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I think for this PR the goal was more to just replace the binary files in the repo with generated files, and adding tests to make sure our reader works the same for both C++ and Java generated files is just an added bonus. It might be a bit of a misnomer to call it an integration test.

I think the next step will be to add a JSON reader/writer and an Arrow IPC writer so we can integrate with the actual integration tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think here we should comment exactly what's being tested (since these tests will pass if (and only if?) the Java <-> C++ integration tests are passing). If we can run these particular JS tests only in the Travis CI entry where we are already running the integration_test.py then we don't need to be checking in binary/JSON files to git

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm yes they should. @TheNeuralBit is right, the goal here was to get the JS version reading C++ and Java arrows identically, and this test is how I validate that it does. Right now it doesn't validate that it reads correctly (that's what the other tests do), just that there's no difference between the two. I think the plan is to update this test when we do have the JSON reader/writer in JS

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome to see support for so many more types, and consistency when reading both C++ and Java-generated files 🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move the calls to nullableMixin and fieldMixin from here and out to each individual call? Are there some subtle differences in some vectors that I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit yeah so the ES6 class spec states that the name property of a class constructor is immutable (and they're also not allowed to be computed properties, like let x = 'myClass'; class [x] extends Foo {}). And anonymous class names default to "Object", reading as Object { data: Int32Array } instead of Int32Vector when debugging. While this is ugly and hard to scale if we want to add more mixin behaviors, I figure it's a win for anyone using the library in the real world to see descriptive class names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, I figured there must be a good reason. This seems like a great application for some kind of preprocessor with macros or a code generator... but I don't know of any that would integrate well with JS build/dev tools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit yeah we could use babel-codegen, babel-preval, or babel-macros if we want. I was hoping to avoid babel if possible, but since we're webpacking the es2015+ UMD bundles anyway, it wouldn't be too much of a headache.

Copy link
Contributor Author

@trxcllnt trxcllnt Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit but if we do want to do more compilation steps beyond what the TS compiler does, it'd be neat to also run prepack on the flatbuffers generated code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I think for this PR the goal was more to just replace the binary files in the repo with generated files, and adding tests to make sure our reader works the same for both C++ and Java generated files is just an added bonus. It might be a bit of a misnomer to call it an integration test.

I think the next step will be to add a JSON reader/writer and an Arrow IPC writer so we can integrate with the actual integration tests.

js/gulpfile.js Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should createTestData and cleanTestData be uncommented so we can remove the arrow files from the repo? I'm thinking these are probably commented now so that other contributors will be able to run the tests without building the Java and C++ impls - if that's the case, maybe we should separate out integration tests, which require the other libraries, and unit tests, which can be run stand-alone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit yes, definitely. I put these in to remind us to generate test data on the fly (and how to do it) for the tests, and remove the arrow files from the test directory. Making sure the CI environment had the C++ and Java libs built before the JS tests run was a bit more than I could bite off yesterday on my flight home :)

@wesm
Copy link
Member

wesm commented Nov 14, 2017

I'm a bit torn here. On the one hand, I don't want to check in 21mb worth of tests to source control. On the other hand, I don't want to hand-write the 11k assertions that the snapshot tests represent (and would also presumably be many-MBs worth of tests anyway).

I believe git compresses files across the network? And if space-on-disk is an issue, I could add a post-clone script to automatically compress the snapshot files after checkout (about 3mb gzipped). Jest doesn't work with compressed snapshot files out of the box, but I could add some steps to the test runner to decompress the snapshots before running.

I guess I'm not quite understanding what snapshot tests accomplish here that normal array comparisons would not. In Java and C++ we have functions that compare the contents of arrays. So when you say hand-writing the snapshot test assertions, what's being tested and why is that the only way to test that behavior? Is there a concern that a programmatic comparison like

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json-integration-test.cc#L180

might not be as strong of an assertion as a UI-based test (what the values from the arrays would actually appear as in the DOM)?

Having the possibility of a single PR bloating the git history by whatever the snap files gzip down to doesn't seem like a good idea. Even having large diffs as the result of automatically generated files on commit isn't ideal

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 14, 2017

@wesm the Jest docs on snapshot testing highlight its utility testing React components, but it's really just a form of test code generation. The tests evaluate all combinations of source lib x arrow format (in reality: [c++, java] x [file, stream]) for each of the generated files (nested, simple, decimal, datetime, primitive, primitive-empty, dictionary, and struct_example), so there are quite a few assertions.

Snapshots capture a bit of runtime type info that would otherwise have to be asserted explicitly, for example that calling `uint64Vector.get(i)` returns a `Uint32Array` of two elements:
exports[`readBuffers cpp stream primitive reads each batch as an Array of Vectors 167`] = `
Uint32Array [
  12840890,
  0,
]
`;
They're also helpful catching regressions (or comparing against pandas) in `Table.toString()`:

exports[`Table cpp file nested toString({ index: true }) prints a pretty Table with an Index column 1`] = `
"Index,                            list_nullable,        struct_nullable
    0,                                     null,       [null,\\"tmo7qBM\\"]
    1,                             [1685103474],      [-583988484,null]
    2,                             [1981297353], [-749108100,\\"yGRfkmw\\"]
    3,     [-2032422645,-2111456179,-895490422],       [820115077,null]
    4,                                     null,                   null
    5,             [null,-434891054,-864560986],                   null
    6,                                     null,  [986507083,\\"U6xvhr7\\"]
    7,                                     null,                   null
    8,                                     null,            [null,null]
    9,                                     null,                   null
   10,                             [-498865952],                   null
   11,                                     null,       [null,\\"ctyWPJf\\"]
   12,                                     null,            [null,null]
   13, [-1076160763,-792439045,-656549144,null],                   null
   14,                                     null,      [1234093448,null]
   15,                   [null,null,1882910932],                   null
   16,                                     null,  [934007407,\\"9QUyEm5\\"]"
`;

It also gives reviewers a chance to see what the tests produce, so if get on a Uint64Array starts returning a Long object instead of a Uint32Array, we can flag that in a code review. That said, it sounds like the JSON reader should be able to do most of this validation.

@wesm
Copy link
Member

wesm commented Nov 14, 2017

It sounds like this is useful for catching code regressions from PR to PR, but it clashes with my sensibilities about testing data libraries, which is that we should have explicit unit tests asserting the correct behavior. In the example you gave ("Snapshots capture a bit of runtime type info that would otherwise have to be asserted explicitly, for example that calling uint64Vector.get(i) returns a Uint32Array of two elements:"), my prior would be that this behavior should be asserted in a unit test.

I'm no expert, so there may be things I'm missing -- are some of the test assertions dependent on the flavor of the deployment target?

@trxcllnt
Copy link
Contributor Author

Snapshots are just a different, data-centric way of writing assertions. Give it a lot of test data, validate the tests results once, then compare diffs after that. If you can eyeball test results and know whether it works, then the computer codegens all the dreadful bits about comparing types, values, etc. (even when it's late and you might otherwise forget to test an edge case).

I'm no expert, so there may be things I'm missing -- are some of the test assertions dependent on the flavor of the deployment target?

No, the assertions should be identical regardless of the compilation target -- they're generated once at the beginning, then all the targets are compared against the same snapshots.

I may have mentioned this before, but they've also helped catching minification bugs. Like before when we did return Long instances, Closure Compiler minified the class name down to something like "zw", so the snapshot tests failed for just the ES5/UMD target.

But on the whole I can't argue with your position. All I can say is I'm proably pretty lazy by normal standards, so I try to make my computer do as much of my homework as possible.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 14, 2017

Maybe another way to phrase it is, disk and network are cheap, my/our time is not. ;-)

edit: shit, I misread the snapshot count; we have 113,940 snapshots, not 11,394

@wesm
Copy link
Member

wesm commented Nov 14, 2017

Snapshots are just a different, data-centric way of writing assertions. Give it a lot of test data, validate the tests results once, then compare diffs after that. If you can eyeball test results and know whether it works, then the computer codegens all the dreadful bits about comparing types, values, etc. (even when it's late and you might otherwise forget to test an edge case).

This "validate the tests results once" part is where I'm getting lost. How do you know whether anything is correct if you don't write down what you expect to be true? I can help with rallying the troops to write more tests. I am a bit bandwidth constrained at the moment with all the 0.8.0 stuff in progress, but I am hopeful that some others can get involved and this will also help with beating on the API and finding rough edges. cc @leifwalsh @scottdraves

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 14, 2017

@wesm This "validate the tests results once" part is where I'm getting lost. How do you know whether anything is correct if you don't write down what you expect to be true?

Ah right, in this case the JSON files are the initial source of truth. I compared the snapshots against the Arrow files read via pandas/pyarrow, and it looked correct. After this (assuming stable test data), the snapshots are the source of truth. If we decide to change the test data, then we have to re-validate the snapshots are what we expect them to be.

But I want to stress, I'm not against doing it differently. I'm also bandwidth constrained, and snapshots get high coverage with minimal effort. It sounds like the JSON reader should provide all the same benefits as snapshot testing. From that perspective, I see snapshots as a stop-gap until the JS JSON reader is done (unless there's a way we can validate columns with the C++ or Java JSON readers from the JS tests?)

With that in mind, I agree it's best not to commit the snapshots to the git history, if we're just going to remove them once the JSON reader is ready. In the interim, I don't mind validating any new JS PR's against my local snapshots, as the volume of JS PR's isn't that high yet.

@wesm
Copy link
Member

wesm commented Nov 15, 2017

Yes, definitely, we are in agreement. We should push for a JSON reader ASAP -- following the C++ reader as a guide, I do not think it is that big of a project, to be honest, when you consider all the hardship of dealing with parsing JSON in C++, which is a complete non-issue in JavaScript.

@trxcllnt
Copy link
Contributor Author

@wesm that sounds good to me. In the meantime can we get this PR merged + finish ASF release scripts, and push a new version to npm? I'm at the point where not having the latest on npm is going to be a problem for projects at work soon, @TheNeuralBit may be feeling this too.

@wesm
Copy link
Member

wesm commented Nov 15, 2017

That's good for me. As soon as the release scripts are good to go I can conduct the release vote on the mailing list. We can close the vote in less than the usual 72 hours so long as we get 3 PMC votes. So we'll need a quick "here's how to verify the release candidate" blurb to direct people do when we start the release vote

@wesm
Copy link
Member

wesm commented Nov 16, 2017

Does anything still need to be done on this branch? If we are still not close to being able to cut a JS release by early next week I will rearrange my priorities to help out

@trxcllnt
Copy link
Contributor Author

@wesm nope, only thing left to do is the ASF release scripts I think

@trxcllnt
Copy link
Contributor Author

@wesm I understand you may be busy, so do you mind if I go ahead and merge this?

@wesm
Copy link
Member

wesm commented Nov 17, 2017

Sorry I have been dragging my feet because I’m not really on board with checking in data files that can be generated as part of CI. Per Slack conversation it seems there are some roadblocks so I’m available as needed today and tomorrow to get this sorted out

@wesm
Copy link
Member

wesm commented Nov 17, 2017

Sorry that I missed the thing. I will figure out what's going on here:

Then I had to manually edit the "primitive.json" file to remove the "binary_nullable" and "binary_nonnullable" columns, because the C++ command fails if they're present (click to expand)

$ ../cpp/build/release/json-integration-test \
  --integration --mode=JSON_TO_ARROW \
  --json=./test/arrows/json/primitive.json \
  --arrow=./test/arrows/cpp/file/primitive.arrow
Found schema: bool_nullable: bool
bool_nonnullable: bool not null
int8_nullable: int8
int8_nonnullable: int8 not null
int16_nullable: int16
int16_nonnullable: int16 not null
int32_nullable: int32
int32_nonnullable: int32 not null
int64_nullable: int64
int64_nonnullable: int64 not null
uint8_nullable: uint8
uint8_nonnullable: uint8 not null
uint16_nullable: uint16
uint16_nonnullable: uint16 not null
uint32_nullable: uint32
uint32_nonnullable: uint32 not null
uint64_nullable: uint64
uint64_nonnullable: uint64 not null
float32_nullable: float
float32_nonnullable: float not null
float64_nullable: double
float64_nonnullable: double not null
binary_nullable: binary
binary_nonnullable: binary not null
utf8_nullable: string
utf8_nonnullable: string not null
Error message: Invalid: Encountered non-hex digit

@wesm
Copy link
Member

wesm commented Nov 17, 2017

This doesn't fail for me locally:

$ ../cpp/build/debug/json-integration-test --integration --json=/tmp/tmp0jga4tt5/generated_primitive.json --arrow=foo.arrow --mode=JSON_TO_ARROW
Found schema: bool_nullable: bool
bool_nonnullable: bool not null
int8_nullable: int8
int8_nonnullable: int8 not null
int16_nullable: int16
int16_nonnullable: int16 not null
int32_nullable: int32
int32_nonnullable: int32 not null
int64_nullable: int64
int64_nonnullable: int64 not null
uint8_nullable: uint8
uint8_nonnullable: uint8 not null
uint16_nullable: uint16
uint16_nonnullable: uint16 not null
uint32_nullable: uint32
uint32_nonnullable: uint32 not null
uint64_nullable: uint64
uint64_nonnullable: uint64 not null
float32_nullable: float
float32_nonnullable: float not null
float64_nullable: double
float64_nonnullable: double not null
binary_nullable: binary
binary_nonnullable: binary not null
utf8_nullable: string
utf8_nonnullable: string not null

Change-Id: I2666e81e459581656fe8d02bcec3b1d0f7ccc7c6
@wesm
Copy link
Member

wesm commented Nov 19, 2017

I found the problem -- one of the primitive integration test files was being clobbered and not run, which was suppressing a failure that should have been raised a long time ago.

In the meantime, there was also a regression from the Java refactor, and we are no longer able to fully read unsigned integer types anymore. I will hack the integration tests for now and open a JIRA about fixing,

here's an example of trying to read a uint16 vector:

16:49:51.051 [main] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.ratio: 8
Error accessing files
Numeric value (65350) out of range of Java short
 at [Source: /tmp/tmpwgopllpl/generated_primitive.json; line: 1111, column: 18]
16:49:51.065 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
com.fasterxml.jackson.core.JsonParseException: Numeric value (65350) out of range of Java short

…gned max for bit width

Change-Id: Id06eaa8a4071a50a2958f46c41fe1e2ea80c2c82
@wesm
Copy link
Member

wesm commented Nov 19, 2017

OK, integration tests should pass now, fingers crossed. I will merge once the build is green, thanks @trxcllnt and @TheNeuralBit for the patience, it is appreciated

@trxcllnt trxcllnt force-pushed the generate-js-test-files branch from 9393b42 to c86dc02 Compare November 19, 2017 22:54
@trxcllnt
Copy link
Contributor Author

@wesm awesome, thanks! After my latest commit, the integration tests all pass now for me locally. Are you fine with this PR as-is, or should I close it and do one from a new branch w/o the test data commits?

@wesm
Copy link
Member

wesm commented Nov 19, 2017

The PR tool should squash out those commits, so I don't think it's a problem. I'll let you know if I run into any issues after the build runs

@trxcllnt trxcllnt force-pushed the generate-js-test-files branch from c86dc02 to c72134a Compare November 20, 2017 00:03
@wesm
Copy link
Member

wesm commented Nov 20, 2017

There was a non-deterministic Plasma failure in the C++/Python entry, but the integration test entry also failed:

[00:23:46] Starting 'test:es5:umd'...
 FAIL  test/table-tests.ts
  ● Test suite failed to run
    Cannot find module '/home/travis/build/apache/arrow/js/targets/es5/umd/Arrow' from 'Arrow.ts'
      
      at Resolver.resolveModule (node_modules/jest-resolve/build/index.js:191:17)
      at Object.<anonymous> (test/Arrow.ts:50:17)
 FAIL  test/reader-tests.ts
  ● Test suite failed to run
    Cannot find module '/home/travis/build/apache/arrow/js/targets/es5/umd/Arrow' from 'Arrow.ts'
      
      at Resolver.resolveModule (node_modules/jest-resolve/build/index.js:191:17)
      at Object.<anonymous> (test/Arrow.ts:50:17)
 FAIL  test/integration-tests.ts
  ● Test suite failed to run
    Cannot find module '/home/travis/build/apache/arrow/js/targets/es5/umd/Arrow' from 'Arrow.ts'
      
      at Resolver.resolveModule (node_modules/jest-resolve/build/index.js:191:17)
      at Object.<anonymous> (test/Arrow.ts:50:17)
 FAIL  test/vector-tests.ts
  ● Test suite failed to run
    Cannot find module '/home/travis/build/apache/arrow/js/targets/es5/umd/Arrow' from 'Arrow.ts'
      
      at Resolver.resolveModule (node_modules/jest-resolve/build/index.js:191:17)
      at Object.<anonymous> (test/Arrow.ts:50:17)
Test Suites: 4 failed, 4 total
Tests:       0 total
Snapshots:   0 total
Time:        2.409s, estimated 24s
Ran all test suites.
[00:23:49] 'test:es5:umd' errored after 2.64 s
[00:23:49] Error: exited with error code: 1
    at ChildProcess.onexit (/home/travis/build/apache/arrow/js/node_modules/end-of-stream/index.js:39:36)
    at ChildProcess.emit (events.js:159:13)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:209:12)
[00:23:49] 'test' errored after 3.42 min
npm ERR! Test failed.  See above for more details.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 20, 2017

@wesm yeah, looks like Closure Compiler threw an exception building the ES5 UMD target: https://travis-ci.org/apache/arrow/jobs/304499884#L4649. I'm not certain, but it could be related to the integration tests running with JDK7 instead of 8. I'll switch the job to use the JS version of the Closure Compiler which, while slower, won't be affected by Java externalities.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 20, 2017

@wesm actually this might be tough -- the JS version of closure-compiler is a bit outdated and broken, and the Java version hasn't supported Java 7 since October.

I don't want to skip running the tests on the ES5 UMD bundle, as that's the lowest-common-denominator for anyone wanting to experiment with Arrow in the browser, and the integration tests validate that public methods and properties don't get minified away.

Is it possible to update the integration job to openjdk8 (like the java version)? If not, I can create a sibling integration-java8 job that includes the JS tests.

@wesm
Copy link
Member

wesm commented Nov 20, 2017

Totally fine to move the integration tests to jdk8

@trxcllnt trxcllnt force-pushed the generate-js-test-files branch from 0a84b23 to 5660eb3 Compare November 20, 2017 06:26
@trxcllnt
Copy link
Contributor Author

trxcllnt commented Nov 20, 2017

@wesm alright! now that the integration tests are passing, I added backwards-compatibility for Arrow files written before 0.8, re-enabled the datetime tests, and removed the generated arrow files from the performance tests. Should be good to go pending this last CI build.

@wesm wesm changed the title ARROW-1693: [JS] Fix reading C++ dictionary-encoded vectors ARROW-1693: [JS] Expand JavaScript implementation, build system, fix integration tests Nov 20, 2017
@wesm
Copy link
Member

wesm commented Nov 20, 2017

Sweet thanks! So after we get the release scripts set up, we can release this, but one problem to be aware of is that the library in its current state cannot read 0.7.1 binary data and possibly not 0.8.0 (in its final form) binary data. Hopefully we can get ARROW-1785 sorted out soon

@wesm wesm closed this in b3a3a74 Nov 20, 2017
@trxcllnt
Copy link
Contributor Author

@wesm I added 4811129 to support reading <= 0.7.1 buffers. I tested on the previous arrow files in the tests, plus a few I generated in pyarrow locally. Is there anything else we need to do on that?

@wesm
Copy link
Member

wesm commented Nov 20, 2017

Ah cool I missed that. I think we are good then, so I suggest we cut a JS release ASAP to make sure we've got the process down and then we can release again after 0.8.0 final goes out. I'm available this week to help out with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants