Auto-configure AMDGPU_ASIC_ID_TABLE_PATH on ROCm startup#4060
Conversation
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Unsloth library's compatibility and user experience on ROCm-enabled systems. It introduces an automated mechanism to correctly configure the AMDGPU_ASIC_ID_TABLE_PATH environment variable, which is crucial for accurate AMD GPU name resolution. This eliminates a common pain point where users previously encountered "file not found" errors and had to resort to manual, privileged operations to fix their GPU identification. The solution is designed to be robust, respecting existing user settings and executing early in the application lifecycle to ensure proper functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an auto-configuration mechanism for AMDGPU_ASIC_ID_TABLE_PATH on ROCm environments, which is a valuable improvement for GPU name resolution without requiring root privileges. The changes are well-structured, implementing a new function configure_amdgpu_asic_id_table_path in unsloth/import_fixes.py and integrating it early in unsloth/__init__.py. The approach correctly prioritizes existing user configurations and includes robust detection for ROCm environments and candidate paths. The use of del to clean up the imported function in __init__.py is a good practice for namespace management.
| try: | ||
| if configured_path.is_file(): | ||
| return str(configured_path) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The try...except Exception: pass block is overly broad. While it's good to handle potential issues with configured_path.is_file(), catching a generic Exception can mask other programming errors or unexpected issues. It's generally better to catch more specific exceptions like OSError or PermissionError that Path.is_file() might raise. If other exceptions are truly intended to be ignored, they should be logged for debugging purposes.
| try: | |
| if configured_path.is_file(): | |
| return str(configured_path) | |
| except Exception: | |
| pass | |
| try: | |
| if configured_path.is_file(): | |
| return str(configured_path) | |
| except OSError as e: | |
| logger.debug(f"Unsloth: Error checking configured AMDGPU_ASIC_ID_TABLE_PATH: {e}") |
| try: | ||
| if candidate.is_file(): | ||
| os.environ[_AMDGPU_ASIC_ID_TABLE_PATH_ENV] = str(candidate) | ||
| if UNSLOTH_ENABLE_LOGGING: | ||
| logger.info( | ||
| f"Unsloth: Set {_AMDGPU_ASIC_ID_TABLE_PATH_ENV}={candidate}" | ||
| ) | ||
| return str(candidate) | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
Similar to the previous comment, the try...except Exception: continue block is very broad. Catching a generic Exception can hide important issues that might occur during the candidate.is_file() check or when setting the environment variable. It's recommended to catch more specific exceptions (e.g., OSError for file system operations) or at least log the exception details if the intent is to ignore them and continue.
| try: | |
| if candidate.is_file(): | |
| os.environ[_AMDGPU_ASIC_ID_TABLE_PATH_ENV] = str(candidate) | |
| if UNSLOTH_ENABLE_LOGGING: | |
| logger.info( | |
| f"Unsloth: Set {_AMDGPU_ASIC_ID_TABLE_PATH_ENV}={candidate}" | |
| ) | |
| return str(candidate) | |
| except Exception: | |
| continue | |
| try: | |
| if candidate.is_file(): | |
| os.environ[_AMDGPU_ASIC_ID_TABLE_PATH_ENV] = str(candidate) | |
| if UNSLOTH_ENABLE_LOGGING: | |
| logger.info( | |
| f"Unsloth: Set {_AMDGPU_ASIC_ID_TABLE_PATH_ENV}={candidate}" | |
| ) | |
| return str(candidate) | |
| except OSError as e: | |
| logger.debug(f"Unsloth: Error checking candidate AMDGPU_ASIC_ID_TABLE_PATH '{candidate}': {e}") | |
| continue |
for more information, see https://pre-commit.ci
) * Auto-configure AMDGPU_ASIC_ID_TABLE_PATH on ROCm startup * Remove ROCm fd2 amdgpu.ids noise filter wrappers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use PyPI bitsandbytes for amd extra to avoid malformed wheel URL * Add amd-preview extra for bitsandbytes continuous wheel channel * Keep amd extra on bitsandbytes>=0.49.1 and remove amd-preview --------- Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary
This PR does three related AMD startup/packaging fixes:
AMDGPU_ASIC_ID_TABLE_PATHat Unsloth startup on ROCm environments so GPU name resolution can work without requiringsudosymlinks._filter_rocm_amdgpu_ids_fd2_noisewrappers and related fd2 filtering machinery entirely.unsloth[amd]install compatibility withuvby using a normal index-based bitsandbytes requirement.Problem
On some AMD/ROCm setups:
torch.cuda.get_device_name(0)can print:/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directoryamdgpu.idsis discoverable.Separately,
uv pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"can fail with:Wheel version does not match filenamefor thebitsandbytes-1.33.7.previewURL (wheel metadata reports0.49.2.dev0).Changes
unsloth/import_fixes.pyconfigure_amdgpu_asic_id_table_path():AMDGPU_ASIC_ID_TABLE_PATHif user already set it._is_rocm_torch_build()detection).amdgpu.idspaths without importing torch:importlib.util.find_spec("torch")/usr/share/libdrm/amdgpu.ids, etc.)AMDGPU_ASIC_ID_TABLE_PATHto the first existing file._filter_rocm_amdgpu_ids_fd2_noise()and associated fd2 stderr redirection helpers.unsloth/__init__.pyconfigure_amdgpu_asic_id_table_path()early, beforedisable_broken_causal_conv1d()and before torch import paths._filter_rocm_amdgpu_ids_fd2_noise()import and all startupwithwrappers.pyproject.tomlamdextra to use a normal index-based requirement:bitsandbytes>=0.49.1(platform-marked)amd-previewextra.Why this approach
amdextra robust with both pip and uv.Validation
python -m py_compile unsloth/import_fixes.py unsloth/__init__.pyCONFIG_RETURN /usr/share/libdrm/amdgpu.idsENV_VALUE /usr/share/libdrm/amdgpu.idsuvinstall checks:amd.uv pip install --refresh "unsloth[amd] @ git+https://github.com/unslothai/unsloth@fix/auto-amdgpu-asic-id-table-path"