Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces GPU support for containers, improves the Ray installation process within containers, and adds unit tests for the image_uri plugin. The changes are generally good, but I have some feedback primarily focused on security and code maintainability.
My main concerns are:
- The use of
--privilegedfor GPU containers poses a significant security risk. - The GPU setup logic in
image_uri.pyhas a lot of duplicated code and relies on hardcoded paths, which could be brittle. - There's an instance of
subprocess.runwithshell=Truewhich is not a best practice. - Some debugging
printstatements have been left in the code.
I've provided specific comments and suggestions to address these points.
| # Use sets to store unique mount destinations | ||
| volume_mounts: Set[str] = set() | ||
| device_mounts: Set[str] = set() | ||
| mount_commands = ["--privileged"] |
There was a problem hiding this comment.
Using --privileged grants the container almost all the capabilities of the host machine, which is a significant security risk. Since you are already explicitly mounting necessary devices and libraries, could the --privileged flag be avoided? It's recommended to use more granular permissions, like specific --cap-add flags, instead of giving full privileges.
| mount_commands = ["--privileged"] | |
| mount_commands = [] |
| # Use glob patterns to discover and mount NVIDIA libraries dynamically | ||
| logger.info("Mounting gpu devices and drivers") |
There was a problem hiding this comment.
This large block of code for GPU support contains many hardcoded paths and glob patterns for NVIDIA drivers and libraries. This approach can be brittle, as it depends on a specific driver installation layout which might vary across different Linux distributions or driver versions. Have you considered leveraging existing tools like nvidia-container-toolkit or nvidia-container-cli to get the required mount paths and devices? This would make the implementation more robust and less dependent on hardcoded paths.
cdf5b5b to
92a6fa4
Compare
| container_command.append(runtime_env_constants.CONTAINER_ENV_PLACEHOLDER) | ||
|
|
||
|
|
||
| if not os.getenv("NVIDIA_VISIBLE_DEVICES"): |
There was a problem hiding this comment.
Define constant for "NVIDIA_VISIBLE_DEVICES" is better.
| # mount ray package site path | ||
| host_site_packages_path = get_ray_site_packages_path() | ||
| # Mount only ray package path. | ||
| # Do NOT overwrite podmans' site packages because it may include necessary packages. |
There was a problem hiding this comment.
Do not overwrite podman's site packages because it may include necessary packages.
| context.env_vars = try_update_runtime_env_vars( | ||
| context.env_vars, redirected_pyenv_folder | ||
| ) | ||
|
|
There was a problem hiding this comment.
Is this modification unnecessary?
| container_command.append("--cap-add=AUDIT_WRITE") | ||
| else: | ||
| # GPU mode | ||
| # Use glob patterns to discover and mount NVIDIA libraries dynamically |
There was a problem hiding this comment.
Use glob patterns to discover and mount NVIDIA libraries dynamically.
| if os.path.exists(path): | ||
| volume_mounts.add(f"{path}:{path}:ro") | ||
|
|
||
| # Define all NVIDIA library and file patterns |
There was a problem hiding this comment.
This actually supports Ant's VGPU. This is an inner feature
4bc15a3 to
ce29f05
Compare
ce29f05 to
74e7241
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Description
image_uriplugin