Skip to content

Conversation

@BryanCutler
Copy link
Member

Arrow Java Writer now requires an IpcOption for some APIs, this patch fixes the compilation to run Spark Integration tests.

Copy link
Member Author

@BryanCutler BryanCutler Sep 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs I thought this would be better in the Dockerfile after checking out spark, but for some reason the patch didn't seem to apply. I was trying this and it seemed to run, but the file wasn't patched

COPY integration/spark/ARROW-6429.patch /tmp/
RUN patch -d /spark -p1 -i /tmp/ARROW-6429.patch

Any ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, it seems to work now. I'll change it..

@BryanCutler
Copy link
Member Author

cc @kszucs @wesm

@BryanCutler BryanCutler force-pushed the spark-integration-patch-ARROW-6429 branch from 7c6d150 to 48b2eac Compare September 22, 2019 05:52
@kszucs
Copy link
Member

kszucs commented Sep 22, 2019

@ursabot crossbow submit docker-spark-integration

@ursabot
Copy link

ursabot commented Sep 22, 2019

AMD64 Conda Crossbow Submit (#64318) builder has been succeeded.

Revision: dd2483f

Submitted crossbow builds: ursa-labs/crossbow @ ursabot-215

Task Status
docker-spark-integration CircleCI

@kszucs
Copy link
Member

kszucs commented Sep 23, 2019

@BryanCutler seems like the build filed with a timeout Too long with no output (exceeded 10m0s)

Any ideas how could we speed up the spark integration test a bit?

@BryanCutler
Copy link
Member Author

Any ideas how could we speed up the spark integration test a bit?

@kszucs , I noticed the Java builds for Arrow and Spark are unusually slow. I'm not sure why, but I'll take a look at the settings. Also, the pyspark tests run twice, once against python and python3.6, which I think are the same in this image. So we can just explicitly test one and that will save a couple minutes I think.

@BryanCutler
Copy link
Member Author

Once #5471 is merged, we should be able to get a pass here (as long as no timeout).


# installing java and maven
ARG MAVEN_VERSION=3.5.4
ARG MAVEN_VERSION=3.6.2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the minimum version used by Spark, so setting this here will prevent Spark from downloading it during the build phase


(echo "Testing PySpark:"; IFS=$'\n'; echo "${SPARK_PYTHON_TESTS[*]}")
python/run-tests --testnames "$(IFS=,; echo "${SPARK_PYTHON_TESTS[*]}")"
python/run-tests --testnames "$(IFS=,; echo "${SPARK_PYTHON_TESTS[*]}")" --python-executables python
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark will look for and test separately against installed python versions, so setting this will make sure to run the tests once on the default python

@BryanCutler
Copy link
Member Author

@kszucs , I made a couple small adjustments but I think the reason for the timeout is that Spark can take a long time during assembly, which downloads and assembles all required dependencies. I don't think there is anyway to avoid this since we need to test pyspark, but perhaps there is someway to cache better.

Since this could take a long time and we limit the output during the build to just warnings, we end up hitting the "too long without output" timeout. Is it possible to increase this by setting no_output_timeout to something like 30min?

@kszucs
Copy link
Member

kszucs commented Sep 24, 2019

@BryanCutler seems like we can increase that timeout https://support.circleci.com/hc/en-us/articles/360007188574-Build-has-hit-timeout-limit

I'm going to try to push and pull images before and after the build and run to spare the docker-compose build time.

@BryanCutler
Copy link
Member Author

Thanks @kszucs , so will you be able to adjust that timeout or can I do it somewhere from this PR? I'm not sure where the config file is.. I'll try to trigger another test since it should pass now if it doesn't timeout.

@BryanCutler
Copy link
Member Author

@ursabot crossbow submit docker-spark-integration

@ursabot
Copy link

ursabot commented Sep 24, 2019

AMD64 Conda Crossbow Submit (#65162) builder has been succeeded.

Revision: 918ab91

Submitted crossbow builds: ursa-labs/crossbow @ ursabot-229

Task Status
docker-spark-integration CircleCI

@kszucs
Copy link
Member

kszucs commented Sep 25, 2019

@BryanCutler #5485 should speed up the things a bit, and sets no_output_timeout to an hour.

@BryanCutler
Copy link
Member Author

@kszucs so should I wait until #5485 is merged to see if we can get a pass here, or go ahead and merge this first?

@kszucs
Copy link
Member

kszucs commented Sep 26, 2019

If you've tried it locally then we can go ahead and merge this.

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch LGTM.

@BryanCutler BryanCutler deleted the spark-integration-patch-ARROW-6429 branch September 26, 2019 19:52
@BryanCutler
Copy link
Member Author

merged to master, thanks for reviewing @kszucs !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants