add stem dependencies in main python sandbox#1099
Conversation
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
📝 WalkthroughWalkthroughThis pull request introduces STEM library support to the sandbox environment by adding a new requirements file and updating the Dockerfile to install these dependencies during the build process. The changes include copying the requirements file, setting environment variables, installing the uv package manager, and installing all STEM-related dependencies. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
requirements/stem.txt (2)
1-201: Consider pinning package versions for reproducibility and build stability.All packages are specified without version constraints. This can lead to:
- Non-reproducible Docker builds (different versions may be installed on different build dates)
- Incompatible transitive dependencies
- Difficult troubleshooting if a package release introduces breaking changes
For production images, version pinning is a best practice.
Example approach:
- arxiv + arxiv==2.1.0 - beautifulsoup4 + beautifulsoup4==4.12.2Alternatively, use a lock file approach (e.g.,
pip-compile,poetry.lock, oruv.lock) to generate pinned versions from a higher-level specification.
1-201: Document the rationale for the STEM dependency list.The file contains 200+ packages spanning many domains (astronomy, chemistry, biology, mathematics, etc.). Consider adding a comment or README explaining:
- What STEM functionality is expected to be available in the sandbox
- Whether this full list is required for all use cases, or if it should be modularized
- How to maintain and update this list over time
This will help future developers understand the scope and avoid accumulating unused dependencies.
dockerfiles/Dockerfile.sandbox (1)
72-80: Minor: Consider consistency in package manager usage.The Dockerfile uses
uvfor STEM requirements (lines 78–80) butpipdirectly forgdown(line 85). While both work, usinguv pipconsistently would be slightly cleaner:RUN mkdir -p /data && uv pip install gdown && \This is optional and does not affect functionality.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dockerfiles/Dockerfile.sandbox(1 hunks)requirements/stem.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (2)
dockerfiles/Dockerfile.sandbox (1)
68-80: STEM installation flow is well-structured, but will fail due to invalid packages in requirements.The Dockerfile changes correctly:
- Separate STEM requirements from base Python dependencies
- Set environment variables for reproducible and efficient pip operations
- Use
uvfor fast, reliable package installation- Order operations correctly (install uv → upgrade pip → install requirements)
However, the build will fail because
requirements/stem.txtcontains invalid packages (e.g.,HTMLParserat line 62), so this Dockerfile change is dependent on fixing the requirements file first.Ensure that all invalid packages are removed from
requirements/stem.txtbefore merging, otherwise the Docker build will fail with a pip error like:ERROR: Could not find a version that satisfies the requirement HTMLParserrequirements/stem.txt (1)
31-31: All specified packages exist on PyPI and are legitimate.The packages cited as potentially invalid are confirmed to be real PyPI packages:
data(v0.4) — handles text/bytes/file-like inputs uniformlydiff(v2023.12.6) — defines a difference protocol for objectsocl(v0.12) — compiles Python code to C99/OpenCL/JavaScriptquote(v3.0) — Goodreads Quote API wrapperutils(v1.0.2) — utility functions libraryWhile these names are common, each corresponds to a specific, functional package on PyPI. No changes are needed.
| grep | ||
| gutenbergpy | ||
| html5lib | ||
| HTMLParser |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the requirements/stem.txt file to see the context
fd -type f -name "stem.txt" | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 232
🌐 Web query:
HTMLParser PyPI package
💡 Result:
The PyPI package "HTMLParser" (package name: HTMLParser) is a small backport of Python 2.7’s HTMLParser for older Python 2.x (last release 0.0.2 in 2013). It is not a modern or maintained parser for current Python 3 — for Python 3 use the stdlib html.parser.HTMLParser or, for robust HTML handling, libraries like BeautifulSoup (bs4) or lxml. Sources: PyPI package page and Python stdlib docs. [1] [2]
Sources
- HTMLParser (PyPI): backport of HTMLParser from Python 2.7. [1]
- html.parser — Python standard library (HTMLParser class). [2]
🏁 Script executed:
# Check if requirements directory exists and show its contents
find . -type f -name "*.txt" -path "*/requirements/*" 2>/dev/null | sortRepository: NVIDIA-NeMo/Skills
Length of output: 229
🏁 Script executed:
# Read the requirements/stem.txt file with line numbers
cat -n requirements/stem.txt | head -70Repository: NVIDIA-NeMo/Skills
Length of output: 1175
🏁 Script executed:
# Also check for Python version requirements in setup files
fd -type f \( -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 232
🏁 Script executed:
# Check for Python version specifications
grep -r "python_requires" . --include="*.py" --include="*.toml" --include="*.cfg" 2>/dev/null | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check for .python-version or similar files
cat .python-version 2>/dev/null || cat runtime.txt 2>/dev/null || echo "No version file found"Repository: NVIDIA-NeMo/Skills
Length of output: 83
🏁 Script executed:
# Look for Dockerfile or similar to understand build context
find . -type f -name "Dockerfile*" | head -3Repository: NVIDIA-NeMo/Skills
Length of output: 165
🏁 Script executed:
# Check the Dockerfile to understand the build context
cat dockerfiles/Dockerfile.nemo-skills | head -100Repository: NVIDIA-NeMo/Skills
Length of output: 2978
🏁 Script executed:
# Try to determine if HTMLParser package can be installed (simulate pip check)
# We can use pip to query package info
python3 << 'EOF'
import urllib.request
import json
# Check PyPI for HTMLParser package details
try:
url = "https://pypi.org/pypi/HTMLParser/json"
with urllib.request.urlopen(url, timeout=5) as response:
data = json.loads(response.read().decode())
print(f"Package: {data['info']['name']}")
print(f"Version: {data['info']['version']}")
print(f"Summary: {data['info']['summary']}")
print(f"Home Page: {data['info']['home_page']}")
print(f"Requires Python: {data['info']['requires_python']}")
print(f"Last updated: {data['info']['last_updated']}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 148
Remove HTMLParser—it's an unmaintained Python 2 backport unsuitable for Python 3.10.
HTMLParser (line 62) is available on PyPI but as version 0.0.2 (last released 2013), a backport of Python 2.7's HTMLParser. The Dockerfile uses Python 3.10, making this package redundant and inappropriate. For HTML parsing, use the built-in html.parser module from the standard library, or rely on the already-present alternatives: beautifulsoup4 (line 8) or html5lib (line 61).
- HTMLParser📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| HTMLParser |
🤖 Prompt for AI Agents
requirements/stem.txt line 62: remove the third-party "HTMLParser" entry because
it's an unmaintained Python 2 backport and redundant for Python 3.10; delete
that line from the requirements file and, if any project code imports HTMLParser
from a package, replace it to use the standard-library html.parser or the
existing beautifulsoup4/html5lib dependencies instead.
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
I dropped DDGS (package for duckduckgo) but I kept requests (a general library for requesting info). @ekmb let me know if you want to drop this.
I ran it locally. Please check if it works on CI/CD.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.