HDDS-13165. [Docs] Python client developer guide.#566
HDDS-13165. [Docs] Python client developer guide.#566jojochuang wants to merge 5 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Hello @jojochuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces a new documentation page, interface/Python.md, providing a guide for Python developers on how to access Apache Ozone using various libraries such as PyArrow, Boto3, and the requests library for HTTPFS REST API. It also includes recipe tutorials for each access method, detailing the steps to access Ozone using PyArrow, Boto3, and HTTPFS REST API within a Docker environment. The recipes provide step-by-step instructions, code examples, and troubleshooting tips to help users quickly get started with Ozone using Python.
Highlights
- Python Client Interface Documentation: Adds a comprehensive guide for Python developers to interact with Apache Ozone using PyArrow, Boto3, and HTTPFS REST API.
- PyArrow Tutorial: Provides a step-by-step tutorial on accessing Ozone using PyArrow within a Docker environment, including setup, configuration, and code examples.
- Boto3 Tutorial: Offers a tutorial on accessing Ozone via Boto3 and the S3 Gateway, detailing the necessary steps and code snippets for a Docker-based setup.
- HTTPFS REST API Tutorial: Presents a guide on accessing Ozone using the HTTPFS REST API with Python's requests library, including configuration and code examples for writing and reading files.
Changelog
Click here to see the changelog
- hadoop-hdds/docs/content/interface/Python.md
- Added a new documentation page providing an overview of accessing Apache Ozone from Python using PyArrow, Boto3, and HTTPFS REST API.
- Includes setup instructions, code examples, and troubleshooting tips for each access method.
- Provides links to recipe tutorials for quick start guides using Ozone's Docker image.
- hadoop-hdds/docs/content/recipe/Boto3Tutorial.md
- Added a recipe tutorial for accessing Ozone using Boto3 and the S3 Gateway within a Docker environment.
- Details the steps to start Ozone in Docker, connect to the SCM container, install Boto3, and access Ozone via Boto3.
- Includes troubleshooting tips for common issues such as access denied errors and connection refusals.
- hadoop-hdds/docs/content/recipe/PyArrowTutorial.md
- Added a recipe tutorial for accessing Ozone using PyArrow within a Docker environment.
- Provides instructions on starting Ozone in Docker, connecting to the SCM container, installing PyArrow, and configuring Hadoop native libraries.
- Includes troubleshooting tips for libhdfs errors and connection issues.
- hadoop-hdds/docs/content/recipe/PythonRequestsOzoneHttpFS.md
- Added a recipe tutorial for accessing Ozone using the HTTPFS REST API with Python's requests library.
- Details the steps to start Ozone in Docker, create a volume and bucket, install the requests library, and access Ozone HTTPFS via Python.
- Includes troubleshooting tips for unauthorized access, bad requests, and connection refusals.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Python's embrace,
Ozone's data finds its place,
Through APIs flow.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive documentation for Python developers looking to integrate with Apache Ozone, covering PyArrow, Boto3, and HTTPFS REST API access methods. The guides are well-structured and the inclusion of Docker-based quickstart tutorials is a great way to help users get started quickly. The content, initially drafted by AI and then manually verified, is a valuable addition.
I've identified a few areas where clarity or correctness could be improved, particularly around port numbers, environment setup for tutorials, and library prerequisites. Addressing these will enhance the accuracy and usability of these new guides.
Overall, excellent work on putting together this much-needed documentation!
Summary of Findings
- Future Date in Python.md Frontmatter: The
datefield inhadoop-hdds/docs/content/interface/Python.mdis set to"2025-06-02". This might be a placeholder or an error. - Minor Typos: Found a few minor typos:
- In
interface/Python.mdsummary: "fssec" should be "fsspec". - In
interface/Python.md: "thirdparty" should be "third-party".
- Leftover Spark Documentation Link: In
interface/Python.md, the reference to Apache Spark Documentation might be a leftover, as PySpark content was intentionally excluded from this PR. - Docker Container Name Placeholder: The tutorials use a specific container name prefix (
weichiu-scm-1). While noted, consider advising users to find their SCM container name (e.g., viadocker ps) or using a more generic placeholder like<your-scm-container-name-or-id>. - Port Discrepancy for PyArrow: The PyArrow examples use port 9864 while
fs.defaultFSincore-site.xmlpoints to 9862. This is confusing and potentially incorrect. - Port Inconsistency for HTTPFS: The main Python guide uses port 9874 for HTTPFS
requestsexample, while the tutorial andfsspecexample use 14000. This should be consistent (likely 14000). - Clarity on Library Prerequisites:
pip install fsspecmight not be enough for WebHDFS access;fsspec[webhdfs]or similar might be needed. - Misleading WebHDFS Client Reference: The link for 'WebHDFS Python Client' points to a different library than what
fsspecuses. - Tutorial Script Execution Environment: Clarity is needed on whether Python scripts in tutorials (especially Boto3) are run on the host or in-container, as this affects service endpoint URLs.
- PyArrow Tutorial Configuration: The PyArrow tutorial needs to clarify how
core-site.xmlis made available to PyArrow, likely by settingHADOOP_CONF_DIR.
Merge Readiness
The new documentation is a significant improvement and very welcome. However, due to some high and medium severity issues related to correctness of port numbers, configuration steps, and clarity of prerequisites, I recommend addressing these points before merging. This will ensure the documentation is accurate and easy for users to follow. I am not authorized to approve pull requests, so please ensure further review and approval from maintainers after addressing the feedback.
|
|
||
| # Connect to Ozone using HadoopFileSystem | ||
| # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml | ||
| fs = pafs.HadoopFileSystem("default", 9864) |
There was a problem hiding this comment.
The core-site.xml example (lines 83-89) defines fs.defaultFS as ofs://om:9862. However, the PyArrow HadoopFileSystem is initialized here with port 9864:
fs = pafs.HadoopFileSystem("default", 9864)
If host="default", PyArrow uses the host from fs.defaultFS (which is om). The port parameter (9864 here) then overrides the port specified in fs.defaultFS (9862).
Is port 9864 intentionally different from the 9862 specified in fs.defaultFS for OFS access? If 9862 is the correct OFS port that om listens on for this, using 9864 could lead to connection issues or confusion.
If 9862 is the intended port, consider changing this to:
fs = pafs.HadoopFileSystem("default")
(which would use the port from fs.defaultFS)
Or, if there's a specific reason for 9864 (e.g., a different service endpoint or a specific mapping in the Docker setup not detailed here), could that be clarified in the documentation?
| fs = pafs.HadoopFileSystem("default", 9864) | |
| fs = pafs.HadoopFileSystem("default") # Uses host and port from fs.defaultFS in core-site.xml |
| import requests | ||
|
|
||
| # Ozone HTTPFS endpoint and file path | ||
| host = "http://httpfs:9874" |
There was a problem hiding this comment.
The host for the HTTPFS endpoint is set to http://httpfs:9874 here. However, the linked tutorial (recipe/PythonRequestsOzoneHttpFS.md) and the fsspec example (line 215) use port 14000. The tutorial also details configuring the httpfs Docker service to expose port 14000.
Is port 9874 a typo here? It seems inconsistent with the rest of the documentation and the tutorial setup. If 14000 is the correct port, this should be updated.
| host = "http://httpfs:9874" | |
| host = "http://httpfs:14000" |
|
|
||
| # Connect to Ozone using HadoopFileSystem | ||
| # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml | ||
| fs = pafs.HadoopFileSystem("default", 9864) |
There was a problem hiding this comment.
Similar to the comment in interface/Python.md, there's a port discrepancy here. The core-site.xml (lines 90-96) defines fs.defaultFS as ofs://om:9862. Initializing HadoopFileSystem with "default" and port 9864 means it will try to connect to the om host (from fs.defaultFS) but on port 9864, overriding the port 9862 from the configuration.
If the Ozone OFS service listens on port 9862 as per fs.defaultFS, this explicit port 9864 might be incorrect or requires explanation (e.g., if it's a different endpoint or a specific mapping in the Docker setup).
For consistency and clarity, if 9862 is the correct OFS port, consider using:
fs = pafs.HadoopFileSystem("default")
This would rely entirely on the fs.defaultFS property from core-site.xml.
| fs = pafs.HadoopFileSystem("default", 9864) | |
| fs = pafs.HadoopFileSystem("default") # Uses host and port from fs.defaultFS in core-site.xml |
| - Ozone S3 Gateway endpoint and bucket names | ||
| - Access credentials (AWS-like key and secret) | ||
| - For HttpFS access: | ||
| - Requests (`pip install requests`) or fsspec (`pip install fsspec`) |
There was a problem hiding this comment.
For accessing WebHDFS via fsspec, simply pip install fsspec might not be sufficient as webhdfs is often an optional dependency. Could you clarify if a more specific installation like pip install fsspec[webhdfs] or installing a separate webhdfs3 (which fsspec often uses) package is required for the fsspec.implementations.webhdfs example to work out of the box? This would help users avoid potential ModuleNotFoundError or similar issues.
| - [Apache Ozone Documentation](https://ozone.apache.org/docs/) | ||
| - [PyArrow Documentation](https://arrow.apache.org/docs/python/) | ||
| - [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) | ||
| - [WebHDFS Python Client](https://pypi.org/project/webhdfs/) |
There was a problem hiding this comment.
The link [WebHDFS Python Client](https://pypi.org/project/webhdfs/) points to the webhdfs PyPI package (by mtth). However, the code example for WebHDFS access (lines 212-224) uses from fsspec.implementations.webhdfs import WebHDFS.
These are different libraries/abstractions. To avoid confusion, would it be better to link to the fsspec documentation, or specifically to documentation about its WebHDFS implementation if available? The current link might lead users to a library not used in the example.
| # Configure Boto3 client | ||
| s3 = boto3.client( | ||
| 's3', | ||
| endpoint_url='http://s3g:9878', |
There was a problem hiding this comment.
The Python script uses endpoint_url='http://s3g:9878'. This URL works if the script is run from within the Docker network (e.g., inside another container that's part of the same docker-compose setup).
However, the tutorial implies installing boto3 in "Your Python Environment" (Step 3), which could be the host machine. If the script is run from the host machine, s3g would not resolve. Assuming the s3g service's port 9878 is mapped to the host (e.g., ports: - "9878:9878" in docker-compose.yaml, which is typical for the Ozone Docker setup), the endpoint URL should be http://localhost:9878.
Could you clarify where the user is expected to run this Python script? If on the host, the endpoint URL and the troubleshooting tip on line 113 (which mentions localhost) should align with this. If inside a container, that should be specified.
| Add the following to `/etc/hadoop/core-site.xml`: | ||
|
|
||
| ```xml | ||
| <configuration> | ||
| <property> | ||
| <name>fs.defaultFS</name> | ||
| <value>ofs://om:9862</value> | ||
| <description>Ozone Manager endpoint</description> | ||
| </property> | ||
| </configuration> | ||
| ``` |
There was a problem hiding this comment.
Step 5 instructs adding configuration to /etc/hadoop/core-site.xml. For PyArrow's HadoopFileSystem to pick up this configuration (especially when host="default" is used), the HADOOP_CONF_DIR environment variable typically needs to be set to the directory containing core-site.xml.
The main guide (interface/Python.md) mentions setting OZONE_CONF_DIR for PyArrow setup. Is OZONE_CONF_DIR also respected by PyArrow for finding core-site.xml, or should this tutorial explicitly mention setting HADOOP_CONF_DIR before running the Python script? Adding this detail would make the setup more robust for users.
Change-Id: I1b30bc46aff2b5b521f44fe6d9897496cadd0522
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds excellent and much-needed documentation for Python developers looking to interact with Apache Ozone. The overall structure, use of code examples, and inclusion of Docker-based recipes are commendable.
I've found a few issues, mainly related to link paths, package names in installation instructions, and some minor inconsistencies or areas for clarification in the recipes. Addressing these will significantly improve the accuracy and usability of these guides.
Summary of Findings
- Documentation Link Accuracy: Found issues with inconsistent relative paths for links to recipe files in
Python.md, which will likely result in broken links (e.g., lines 92, 201 inPython.md). - Package Installation Instructions: The troubleshooting section in
Python.md(line 232) lists an incorrect package (webhdfs) for WebHDFS access viafsspec. It should befsspec. - Clarity and Consistency: Minor issues regarding clarity in Docker Compose modification instructions and consistency in user names for examples were noted. An outdated troubleshooting tip was also found.
Merge Readiness
The documentation added in this PR is a significant step forward. However, due to the broken links and the incorrect package name in installation instructions, I recommend addressing these issues before merging. The other suggestions should also be considered to improve clarity and user experience. I am unable to approve pull requests, but once these changes are made, the documentation will be in much better shape for users.
| </configuration> | ||
| ``` | ||
|
|
||
| Try it yourself! Check out [PyArrow Tutorial](recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image. |
There was a problem hiding this comment.
The relative path to the PyArrow tutorial seems incorrect. Given that Python.md is in content/interface/ and the recipe PyArrowTutorial.md is in content/recipe/, the link should likely use ../ to go up one level from interface before accessing recipe/.
Could you verify this path? A similar link to the Boto3 tutorial (line 130) uses ../recipe/Boto3Tutorial.md, which seems to be the correct pattern.
| Try it yourself! Check out [PyArrow Tutorial](recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image. | |
| [PyArrow Tutorial](../recipe/PyArrowTutorial.md) |
| print(resp_read.text) | ||
| ``` | ||
|
|
||
| Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image. |
There was a problem hiding this comment.
Similar to the PyArrow tutorial link, this path to the HTTPFS tutorial seems incorrect. If Python.md is in content/interface/ and recipes are in content/recipe/, this link should likely be ../recipe/PythonRequestsOzoneHttpFS.md to correctly navigate to the recipe file.
Could you please check and update this path for consistency and correctness?
| Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image. | |
| [Access Ozone using HTTPFS REST API Tutorial](../recipe/PythonRequestsOzoneHttpFS.md) |
| - **Authentication Errors**: Verify credentials and Kerberos tokens (if used). | ||
| - **Connection Issues**: Check endpoint URLs, ports, and firewall rules. | ||
| - **FileSystem Errors**: Ensure correct Ozone configuration and appropriate permissions. | ||
| - **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests webhdfs`). |
There was a problem hiding this comment.
The package webhdfs is mentioned here for installation. However, the document introduces fsspec for WebHDFS access (lines 48, 206-210), and fsspec is the common library that provides WebHDFS filesystem implementation. The webhdfs PyPI package seems to be a different, less common one.
Should this be fsspec instead of webhdfs to align with the example provided using fsspec.implementations.webhdfs?
| - **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests webhdfs`). | |
| - **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests fsspec`). |
|
No such command.
|
Change-Id: I4715a44d1e75c7e9c0f2bfb347b1ac190e5f8fc5
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces excellent and much-needed Python developer guides for accessing Apache Ozone. The documentation is well-organized, covering PyArrow, Boto3, and HttpFS (requests/fsspec) with practical Docker-based tutorials. The initial generation via ChatGPT combined with manual verification has resulted in a solid set of guides.
My feedback primarily focuses on minor clarifications and consistency improvements to enhance the developer experience further. The core content and examples are generally accurate and helpful.
Summary of Findings
- Documentation Date: The
dateinhadoop-hdds/docs/content/interface/Python.mdfrontmatter is set to a future date (2025-06-02). This should be clarified or updated. - PyArrow
HadoopFileSystemPort Parameter: In bothPython.mdandPyArrowTutorial.md, theHadoopFileSystem("default", port)call includes a port number. When the host is "default", this port parameter is ignored bylibhdfs, which relies onfs.defaultFSfromcore-site.xml. Suggested removing the port or adding a clarifying comment. - Boto3 Tutorial Clarity: In
Boto3Tutorial.md, a note about S3 bucket naming could be more precise. A troubleshooting tip refers toweichiu-s3g-1(from the generation prompt) instead ofs3g(used in the tutorial's code). - Minor Typo (Not commented due to severity settings): In
hadoop-hdds/docs/content/recipe/Boto3Tutorial.md(line 52), there's a minor typo: "Create a a bucket" should be "Create a bucket".
Merge Readiness
The pull request significantly improves the Python client documentation for Ozone and is a valuable contribution. The content is largely accurate and well-presented.
I have requested changes for a few medium-severity issues primarily related to documentation clarity and consistency. Addressing these points will further enhance the quality and usability of these guides for developers.
As a reviewer, I am not authorized to approve pull requests. Please ensure these changes are reviewed and approved by a maintainer before merging.
|
|
||
| # Connect to Ozone using HadoopFileSystem | ||
| # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml | ||
| fs = pafs.HadoopFileSystem("default", 9862) |
There was a problem hiding this comment.
When host is set to "default" in pafs.HadoopFileSystem, the port parameter is ignored. The HDFS C client (libhdfs) will instead use the host and port defined in fs.defaultFS from core-site.xml.
To improve clarity and avoid potential confusion about the 9862 port parameter's role here, consider either:
- Removing the port parameter, as it's redundant when
host="default". - Adding a comment to explain that the port parameter is ignored in this specific case.
Option 1 (removing the port) is generally cleaner.
| fs = pafs.HadoopFileSystem("default", 9862) | |
| fs = pafs.HadoopFileSystem("default") |
| ## Notes | ||
|
|
||
| - The S3 Gateway listens on port `9878` by default. | ||
| - The bucket name used in Boto3 should match the one created inside your `s3v` volume. |
There was a problem hiding this comment.
The current note says: "The bucket name used in Boto3 should match the one created inside your s3v volume."
This could be slightly confusing as s3v is the Ozone volume, and the S3 bucket name is just the Ozone bucket's name (e.g., bucket).
Could we rephrase for better clarity? For example: "The Bucket parameter in Boto3 calls (e.g., 'bucket' in the script) should be the name of the Ozone bucket created under the s3v volume (i.e., the bucket part of s3v/bucket). The s3v volume itself is implicitly handled by the S3 Gateway."
| - The bucket name used in Boto3 should match the one created inside your `s3v` volume. | |
| The `Bucket` parameter in Boto3 calls (e.g., `'bucket'` in the script) should be the name of the Ozone bucket created under the `s3v` volume (i.e., the `bucket` part of `s3v/bucket`). The `s3v` volume itself is implicitly handled by the S3 Gateway. |
|
|
||
| - **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names). | ||
| - **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint. | ||
| - **Timeout or DNS Issues**: Try replacing `weichiu-s3g-1` with `localhost` or the actual IP address depending on how your Docker network is set up. |
There was a problem hiding this comment.
The troubleshooting tip mentions weichiu-s3g-1, which was in the original prompt for ChatGPT. However, the example code in this tutorial (line 79) uses endpoint_url='http://s3g:9878'.
To maintain consistency with the tutorial's code, should this tip refer to s3g instead of weichiu-s3g-1?
| - **Timeout or DNS Issues**: Try replacing `weichiu-s3g-1` with `localhost` or the actual IP address depending on how your Docker network is set up. | |
| Try replacing `s3g` with `localhost` or the actual IP address of the S3 Gateway container, depending on your Docker network setup and where you are running the script. |
|
|
||
| # Connect to Ozone using HadoopFileSystem | ||
| # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml | ||
| fs = pafs.HadoopFileSystem("default", 9862) |
There was a problem hiding this comment.
Similar to the main Python.md guide, when host is "default" in pafs.HadoopFileSystem, the port parameter (9862 here) is ignored. The connection details are taken from fs.defaultFS in core-site.xml.
For clarity, it's recommended to either remove the port parameter or add a comment explaining its irrelevance in this context. Removing the port is cleaner.
| fs = pafs.HadoopFileSystem("default", 9862) | |
| fs = pafs.HadoopFileSystem("default") |
|
No such command.
|
Change-Id: I29bc9476a100190f3c7a927f5b68e6fbb36c5c79
|
/gemini |
|
No such command.
|
|
Hi @jojochuang, it looks like the command |
|
/gemini review |
|
No such command.
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the documentation for Python developers looking to integrate with Apache Ozone. The inclusion of guides for PyArrow, Boto3, and HTTPFS (via requests and fsspec) covers the primary access methods well. The step-by-step tutorials using Docker are a fantastic way for users to quickly test and understand these integrations.
The overall structure and content are good, stemming from a well-defined prompt and manual verification. I've identified a few areas for improvement, primarily concerning clarity, consistency, and a critical configuration detail in one of the tutorials.
Summary of Findings
- Documentation Date: The
dateinPython.mdfrontmatter is set to the future (2025). This should likely be a current or past date. - Configuration Directory Variable: Clarification requested for
OZONE_CONF_DIRvs.HADOOP_CONF_DIRinPython.mdfor broader applicability beyond the specific Docker setup. - User Consistency in HttpFS Examples: In
Python.md, thefsspecexample usesuser='hdfs'while therequestsexample usesuser='ozone'for HttpFS. Consistency or explanation for the difference is needed. - Troubleshooting Tip Context: The
Boto3Tutorial.mdtroubleshooting tip fors3gvs.localhostneeds context clarification, as the tutorial assumes execution within a Docker container. - Critical Docker Compose Configuration: The
PythonRequestsOzoneHttpFS.mdtutorial has adocker-compose.yamlsnippet forhttpfsthat is likely missing the parentservices:key, which would make the configuration invalid. - Shebang in Python Scripts (Not Commented): Python scripts in examples use
#!/usr/bin/python. Consider#!/usr/bin/env python3or#!/usr/bin/env pythonfor better portability. (Low severity, not commented per settings) - Repeated Installation Instructions (Not Commented): Some
pip installinstructions are repeated from prerequisites within specific method sections inPython.md. This is minor. (Low severity, not commented per settings)
Merge Readiness
The documentation added in this PR is valuable. However, there is one critical issue regarding the docker-compose.yaml configuration in the HTTPFS tutorial that must be addressed before merging. The other medium-severity items should also be considered for clarification and improvement to ensure the guides are as accurate and easy to follow as possible.
I am unable to approve pull requests directly. Please address the identified issues, particularly the critical one, and then seek further review and approval from other maintainers.
| ```yaml | ||
| httpfs: | ||
| <<: *image | ||
| ports: | ||
| - 14000:14000 | ||
| environment: | ||
| CORE-SITE.XML_fs.defaultFS: "ofs://om" | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | ||
| OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1} | ||
| <<: *common-config | ||
| command: [ "ozone","httpfs" ] | ||
| ``` |
There was a problem hiding this comment.
The YAML snippet provided for modifying docker-compose.yaml to add the httpfs service appears to be missing the parent services: key.
In a docker-compose.yaml file, service definitions like httpfs: should be nested under a top-level services: key. As written, with httpfs: indented at the same level as a potential version: key, this would likely result in an invalid docker-compose.yaml file.
Could you verify the correct structure? It should likely be:
services:
# ... other services (om, scm, datanode, s3g) ...
httpfs:
<<: *image
ports:
- "14000:14000"
environment:
CORE-SITE.XML_fs.defaultFS: "ofs://om"
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1}
<<: *common-config
command: [ "ozone","httpfs" ]Ensuring this structure is correct is crucial for the tutorial to work.
| ```yaml | |
| httpfs: | |
| <<: *image | |
| ports: | |
| - 14000:14000 | |
| environment: | |
| CORE-SITE.XML_fs.defaultFS: "ofs://om" | |
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | |
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | |
| OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1} | |
| <<: *common-config | |
| command: [ "ozone","httpfs" ] | |
| ``` | |
| services: | |
| httpfs: | |
| <<: *image | |
| ports: | |
| - "14000:14000" | |
| environment: | |
| CORE-SITE.XML_fs.defaultFS: "ofs://om" | |
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | |
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | |
| OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1} | |
| <<: *common-config | |
| command: [ "ozone","httpfs" ] |
| Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and `OZONE_CONF_DIR` is set. | ||
| Also ensure `ARROW_LIBHDFS_DIR` and `CLASSPATH` are set properly. |
There was a problem hiding this comment.
The document mentions ensuring OZONE_CONF_DIR is set for PyArrow configuration. The PyArrow tutorial clarifies that OZONE_CONF_DIR is set within the Ozone Docker environment.
For users setting up PyArrow access outside of that specific Docker context (e.g., on a system where Hadoop is installed separately), would HADOOP_CONF_DIR be the more conventional environment variable to point to core-site.xml and ozone-site.xml?
Could it be beneficial to add a note clarifying this, or mentioning that HADOOP_CONF_DIR might also be used depending on the environment?
| ```python | ||
| from fsspec.implementations.webhdfs import WebHDFS | ||
|
|
||
| fs = WebHDFS(host='httpfs', port=14000, user='hdfs') |
There was a problem hiding this comment.
In the fsspec (WebHDFS) code example, the user is specified as user='hdfs'. However, in the requests example for HttpFS (line 154), user = "ozone" is used, and the tutorial PythonRequestsOzoneHttpFS.md also uses user = "ozone".
Is this difference in user (hdfs vs. ozone) intentional and correct for fsspec versus requests when interacting with HttpFS in the described setup? If so, a brief note explaining why might be helpful. If not, should they be consistent (e.g., both use ozone or another user configured for proxy access)?
|
|
||
| - **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names). | ||
| - **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint. | ||
| - **Timeout or DNS Issues**: Try replacing `s3g` with `localhost` or the actual IP address depending on how your Docker network is set up. |
There was a problem hiding this comment.
The troubleshooting tip suggests replacing s3g with localhost or an IP address. This is a good general tip. However, the tutorial explicitly states (line 50) that "The rest of the tutorial will run on this container" (the SCM container).
If the Python script is indeed run from within the SCM container (or any container on the same Docker network), s3g (as the S3 Gateway's service name) should resolve correctly. The localhost suggestion would typically apply if the script were run from the host machine with ports mapped, or if the S3G was on the same container (which it isn't here).
Could you clarify the context for this tip? Perhaps rephrase to something like: "If you adapt this script to run from your host machine (outside Docker), you might need to replace s3g:9878 with localhost:9878 (assuming default port mapping)."?
Change-Id: I4ce9349ee9f4e53dc4f53009c784eef04bf857f3
What changes were proposed in this pull request?
HDDS-13165. [Docs] Python client developer guide.
Please describe your PR in detail:
For interface/Python.md, the draft was generated using ChatGPT 4o using the prompt:
The PyArrow recipe draft was generated using ChatGPT 4o prompt:
The Boto3 recipe draft was generated using ChatGPT 4o prompt:
The httpfs receipe draft was generated using ChatGPT 4o, prompt:
Note: initially the draft had PySpark content. Due to the length of the content, I decided to leave it out. Will work on it in a follow-up task.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13165?filter=-1
How was this patch tested?
After Gemini/ChatGPT generated the user doc draft, I manually followed the code samples and verified the steps.