Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release xgboost 1.2 with GPU support #134

Merged
merged 4 commits into from
Sep 15, 2020

Conversation

edwardjkim
Copy link
Contributor

Description of changes:

This CR upgrades XGBoost to 1.2 and enables GPU support.

  • When the image is build with xgboost 1.1, many integration tests fail with an xgboost error on feature mismatch, e.g.,
    xgboost.core.XGBoostError: [16:47:33] /workspace/src/learner.cc:1062: Check failed: learner_model_param_.num_feature == p_fmat->Info().num_col_ (9 vs. 8) : Number of columns does not match number of features in booster.
    
    This is due to a bug in 1.1 (github issues: xgboost 1.1.1 pred failed, while 0.90 pred success dmlc/xgboost#5841, Regression demo is broken dmlc/xgboost#5709). This has been fixed in 1.2 (Fix prediction heuristic dmlc/xgboost#5955). From 1.2 release notes:
    Restore capability to run prediction when the test input has fewer features than the training data (#5955). This capability is necessary to support predicting with LIBSVM inputs. The previous release (1.1) had broken this capability, so we restore it in this version with better tests.
    
    Since it doesn't sound like upstream XGBoost will backport this fix to 1.1, we release 1.2 in this CR.
  • New in XGBoost 1.1 & 1.2
  • MLIO needs to be upgraded. The latest version of MLIO is v0.6. However, the conda package for v0.5 and v0.6 add ~3GB uncompressed (~1GB compressed) to the docker image (mainly due to a huge list of dependencies for image reader, e.g., ffmpeng, opencv, which were newly added in v0.5) increasing training time by ~1 minute. Thus, Dockerfile is optimized and rewritten to install mlio from source. The final image size is 1326.14 MB (compressed) with XGBoost 1.2, MLIO upgrade, and GPU support, compared to 1225.65 MB (compressed) for 1.0-1-cpu-py3.
  • GPU support
    • We could install the CUDA toolkit, but installing CUDA Toolkit will increase the image size by around 700 MB (compressed). The proposed base image nvidia/cuda:${CUDA_VERSION}-base-ubuntu${UBUNTU_VERSION} is a small image that contains a minimal set of CUDA runtime files.
    • Customers will have to specify the parameter tree_method: gpu_hist (and use the correct instance type, e.g., p3.xlarge, p3.2xlarge) to enable GPU training.
  • With GPU support in the same image as the CPU image, it is no longer necessary to append the architecture in the image tag. Since we dropped Python 2 support, the -cpu-py3 in the framework version is also redundant, and this CR proposes to drop the -<architecture>-<python version> suffix. (However, we will keep the old tag format in the deployment pipelines for backwards compatibility. That is, we will tag the same image with two tags: 1.2-1 and 1.2-1-cpu-py3.)

Testing: tox, integration tests

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this have any impact on performance ? minor maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. Those environment variables were kept from previous versions, e.g., https://github.com/aws/sagemaker-xgboost-container/blob/master/docker/1.0-1/base/Dockerfile.cpu#L32, and I didn't think to remove them.

Copy link
Contributor

@balajitummala balajitummala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants