Skip to content

HDDS-13165. [Docs] Python client developer guide.#566

Open
jojochuang wants to merge 5 commits intomasterfrom
HDDS-13165
Open

HDDS-13165. [Docs] Python client developer guide.#566
jojochuang wants to merge 5 commits intomasterfrom
HDDS-13165

Conversation

@jojochuang
Copy link
Copy Markdown
Owner

What changes were proposed in this pull request?

HDDS-13165. [Docs] Python client developer guide.

Please describe your PR in detail:

  • Added interface/Python.md: overall Python client access introduction.
  • Recipe: Access Ozone using PyArrow (Docker Quickstart)
  • Recipe: Access Ozone using Boto3 (Docker Quickstart)
  • Recipe: Access Ozone using HTTPFS REST API (Docker + Python Requests)

For interface/Python.md, the draft was generated using ChatGPT 4o using the prompt:

Create a user document in Markdown format for Python developers who want to access Apache Ozone. This document will be part of the Ozone Client Interfaces page: https://ozone.apache.org/docs/edge/interface.html.

📌 *Audience*: Python developers familiar with Python integration and Ozone. Skip the introduction.

📌 *Structure*:

Setup and Prerequisites:
Required libraries (PyArrow, Boto3, WebHDFS)
Required configurations (e.g., HADOOP_CONF_DIR, Ozone URIs, credentials, authentication)
Access Method 1: PyArrow with libhdfs
Setup steps (including any system paths or environment variables)
Python code sample (validate for correctness)
Access Method 2: Boto3 with Ozone S3 Gateway
Setup steps (including Ozone S3 endpoint format, bucket naming conventions, credentials)
Python code sample (validate for correctness)
Access Method 3: WebHDFS/HttpFS or REST API
Setup steps (including endpoint URL, authentication)
Python code sample (using requests or webhdfs)
Access from PySpark
Configuration settings in Spark (fs.ozone. settings)
Python code sample for reading/writing data to Ozone
Troubleshooting Tips
Common issues (e.g., authentication failures, connection errors)
Suggested debugging techniques
References and Further Resources
Links to official Ozone documentation, PyArrow, Boto3, WebHDFS, PySpark
📌 *Markdown Format*:

Use proper headers (##, ###) for each section.
Include Python syntax highlighting in code blocks (```python).
Use clear formatting and spacing for readability.
Include warnings or notes where appropriate (e.g., > *Note:*).
If applicable, include a simple diagram showing connection flows.
📌 *Quality Checks*:

Validate all code samples for correctness.
Ensure the document is clear and concise.
Focus only on actionable instructions and setup information.
Generate the complete Markdown document in response. Include a Hugo header. Include Apache License header

The PyArrow recipe draft was generated using ChatGPT 4o prompt:

I personally verified the following steps using Ozone's Docker image. Please rewrite in a user tutorial format.

PyArrow to access Ozone

# Download the latest Docker Compose configuration file
curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml

docker compose up -d --scale datanode=3


connect to the SCM container:

docker exec -it weichiu-scm-1 bash

ozone sh volume create volume
ozone sh bucket create volume/bucket

pip install pyarrow


curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’
or
curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’

export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
export CLASSPATH=$(ozone classpath ozone-tools)

Add to /etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>ofs://om:9862</value>
                <description>Where HDFS NameNode can be found on the network</description>
        </property>
</configuration>


Code:

#!/usr/bin/python
import pyarrow.fs as pafs

# Create Hadoop FileSystem object
fs = pafs.HadoopFileSystem("default", 9864)

fs.create_dir("volume/bucket/aaa")

path = "volume/bucket/file1"
with fs.open_output_stream(path) as stream:
        stream.write(b'data')

The Boto3 recipe draft was generated using ChatGPT 4o prompt:

Following the similar PyArrow using Ozone Docker image tutorial, create a similar one for boto3 using the following instructions:

ozone sh bucket create s3v/bucket



Code

#!/usr/bin/python
import boto3

# Create a local file to upload
with open("localfile.txt", "w") as f:
    f.write("Hello from Ozone via Boto3!\n")

# Configure Boto3 client
s3 = boto3.client(
's3',
endpoint_url='http://weichiu-s3g-1:9878',
aws_access_key_id='ozone-access-key',
aws_secret_access_key='ozone-secret-key'
)

# List buckets
response = s3.list_buckets()
print(response['Buckets'])

# Upload a file
s3.upload_file('localfile.txt', 'bucket', 'file.txt')

# Download a file
s3.download_file('bucket', 'file.txt', 'downloaded.txt’)

The httpfs receipe draft was generated using ChatGPT 4o, prompt:

Use the following instructions to create a tutorial of accessing Ozone using HttpFS REST API via requests library

Ozone httpfs using Python requests



# Download the latest Docker Compose configuration file
curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
add to docker-compose.yaml:   CORE-SITE.XML_fs.defaultFS: "ofs://om"
   CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
   CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"

docker compose up -d --scale datanode=3


connect to the SCM container:



ozone sh volume create vol1 
ozone sh bucket create vol1/bucket1

pip install requests

#!/usr/bin/python
import requests

# Ozone HTTPFS endpoint and file path
host = "http://weichiu-httpfs-1:14000"
volume = “vol1"
bucket = "bucket1"
filename = "hello.txt"
path = f"/webhdfs/v1/{volume}/{bucket}/{filename}"
user = "ozone"  # can be any value in simple auth mode

# Step 1: Initiate file creation (responds with 307 redirect)
params_create = {
    "op": "CREATE",
    "overwrite": "true",
    "user.name": user
}

print("Creating file...")
resp_create = requests.put(host + path, params=params_create, allow_redirects=False)

if resp_create.status_code != 307:
    print(f"Unexpected response: {resp_create.status_code}")
    print(resp_create.text)
    exit(1)

redirect_url = resp_create.headers['Location']
print(f"Redirected to: {redirect_url}")

# Step 2: Write data to the redirected location with correct headers
headers = {"Content-Type": "application/octet-stream"}
content = b"Hello from Ozone HTTPFS!\n"

resp_upload = requests.put(redirect_url, data=content, headers=headers)
if resp_upload.status_code != 201:
    print(f"Upload failed: {resp_upload.status_code}")
    print(resp_upload.text)
    exit(1)
print("File created successfully.")

# Step 3: Read the file back
params_open = {
    "op": "OPEN",
    "user.name": user
}

print("Reading file...")
resp_read = requests.get(host + path, params=params_open, allow_redirects=True)
if resp_read.ok:
    print("File contents:")
    print(resp_read.text)
else:
    print(f"Read failed: {resp_read.status_code}")
    print(resp_read.text)

Note: initially the draft had PySpark content. Due to the length of the content, I decided to leave it out. Will work on it in a follow-up task.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13165?filter=-1

How was this patch tested?

After Gemini/ChatGPT generated the user doc draft, I manually followed the code samples and verified the steps.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jojochuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new documentation page, interface/Python.md, providing a guide for Python developers on how to access Apache Ozone using various libraries such as PyArrow, Boto3, and the requests library for HTTPFS REST API. It also includes recipe tutorials for each access method, detailing the steps to access Ozone using PyArrow, Boto3, and HTTPFS REST API within a Docker environment. The recipes provide step-by-step instructions, code examples, and troubleshooting tips to help users quickly get started with Ozone using Python.

Highlights

  • Python Client Interface Documentation: Adds a comprehensive guide for Python developers to interact with Apache Ozone using PyArrow, Boto3, and HTTPFS REST API.
  • PyArrow Tutorial: Provides a step-by-step tutorial on accessing Ozone using PyArrow within a Docker environment, including setup, configuration, and code examples.
  • Boto3 Tutorial: Offers a tutorial on accessing Ozone via Boto3 and the S3 Gateway, detailing the necessary steps and code snippets for a Docker-based setup.
  • HTTPFS REST API Tutorial: Presents a guide on accessing Ozone using the HTTPFS REST API with Python's requests library, including configuration and code examples for writing and reading files.

Changelog

Click here to see the changelog
  • hadoop-hdds/docs/content/interface/Python.md
    • Added a new documentation page providing an overview of accessing Apache Ozone from Python using PyArrow, Boto3, and HTTPFS REST API.
    • Includes setup instructions, code examples, and troubleshooting tips for each access method.
    • Provides links to recipe tutorials for quick start guides using Ozone's Docker image.
  • hadoop-hdds/docs/content/recipe/Boto3Tutorial.md
    • Added a recipe tutorial for accessing Ozone using Boto3 and the S3 Gateway within a Docker environment.
    • Details the steps to start Ozone in Docker, connect to the SCM container, install Boto3, and access Ozone via Boto3.
    • Includes troubleshooting tips for common issues such as access denied errors and connection refusals.
  • hadoop-hdds/docs/content/recipe/PyArrowTutorial.md
    • Added a recipe tutorial for accessing Ozone using PyArrow within a Docker environment.
    • Provides instructions on starting Ozone in Docker, connecting to the SCM container, installing PyArrow, and configuring Hadoop native libraries.
    • Includes troubleshooting tips for libhdfs errors and connection issues.
  • hadoop-hdds/docs/content/recipe/PythonRequestsOzoneHttpFS.md
    • Added a recipe tutorial for accessing Ozone using the HTTPFS REST API with Python's requests library.
    • Details the steps to start Ozone in Docker, create a volume and bucket, install the requests library, and access Ozone HTTPFS via Python.
    • Includes troubleshooting tips for unauthorized access, bad requests, and connection refusals.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Python's embrace,
Ozone's data finds its place,
Through APIs flow.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive documentation for Python developers looking to integrate with Apache Ozone, covering PyArrow, Boto3, and HTTPFS REST API access methods. The guides are well-structured and the inclusion of Docker-based quickstart tutorials is a great way to help users get started quickly. The content, initially drafted by AI and then manually verified, is a valuable addition.

I've identified a few areas where clarity or correctness could be improved, particularly around port numbers, environment setup for tutorials, and library prerequisites. Addressing these will enhance the accuracy and usability of these new guides.

Overall, excellent work on putting together this much-needed documentation!

Summary of Findings

  • Future Date in Python.md Frontmatter: The date field in hadoop-hdds/docs/content/interface/Python.md is set to "2025-06-02". This might be a placeholder or an error.
  • Minor Typos: Found a few minor typos:
  • In interface/Python.md summary: "fssec" should be "fsspec".
  • In interface/Python.md: "thirdparty" should be "third-party".
  • Leftover Spark Documentation Link: In interface/Python.md, the reference to Apache Spark Documentation might be a leftover, as PySpark content was intentionally excluded from this PR.
  • Docker Container Name Placeholder: The tutorials use a specific container name prefix (weichiu-scm-1). While noted, consider advising users to find their SCM container name (e.g., via docker ps) or using a more generic placeholder like <your-scm-container-name-or-id>.
  • Port Discrepancy for PyArrow: The PyArrow examples use port 9864 while fs.defaultFS in core-site.xml points to 9862. This is confusing and potentially incorrect.
  • Port Inconsistency for HTTPFS: The main Python guide uses port 9874 for HTTPFS requests example, while the tutorial and fsspec example use 14000. This should be consistent (likely 14000).
  • Clarity on Library Prerequisites: pip install fsspec might not be enough for WebHDFS access; fsspec[webhdfs] or similar might be needed.
  • Misleading WebHDFS Client Reference: The link for 'WebHDFS Python Client' points to a different library than what fsspec uses.
  • Tutorial Script Execution Environment: Clarity is needed on whether Python scripts in tutorials (especially Boto3) are run on the host or in-container, as this affects service endpoint URLs.
  • PyArrow Tutorial Configuration: The PyArrow tutorial needs to clarify how core-site.xml is made available to PyArrow, likely by setting HADOOP_CONF_DIR.

Merge Readiness

The new documentation is a significant improvement and very welcome. However, due to some high and medium severity issues related to correctness of port numbers, configuration steps, and clarity of prerequisites, I recommend addressing these points before merging. This will ensure the documentation is accurate and easy for users to follow. I am not authorized to approve pull requests, so please ensure further review and approval from maintainers after addressing the feedback.


# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default", 9864)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The core-site.xml example (lines 83-89) defines fs.defaultFS as ofs://om:9862. However, the PyArrow HadoopFileSystem is initialized here with port 9864:
fs = pafs.HadoopFileSystem("default", 9864)

If host="default", PyArrow uses the host from fs.defaultFS (which is om). The port parameter (9864 here) then overrides the port specified in fs.defaultFS (9862).

Is port 9864 intentionally different from the 9862 specified in fs.defaultFS for OFS access? If 9862 is the correct OFS port that om listens on for this, using 9864 could lead to connection issues or confusion.

If 9862 is the intended port, consider changing this to:
fs = pafs.HadoopFileSystem("default")
(which would use the port from fs.defaultFS)

Or, if there's a specific reason for 9864 (e.g., a different service endpoint or a specific mapping in the Docker setup not detailed here), could that be clarified in the documentation?

Suggested change
fs = pafs.HadoopFileSystem("default", 9864)
fs = pafs.HadoopFileSystem("default") # Uses host and port from fs.defaultFS in core-site.xml

import requests

# Ozone HTTPFS endpoint and file path
host = "http://httpfs:9874"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The host for the HTTPFS endpoint is set to http://httpfs:9874 here. However, the linked tutorial (recipe/PythonRequestsOzoneHttpFS.md) and the fsspec example (line 215) use port 14000. The tutorial also details configuring the httpfs Docker service to expose port 14000.

Is port 9874 a typo here? It seems inconsistent with the rest of the documentation and the tutorial setup. If 14000 is the correct port, this should be updated.

Suggested change
host = "http://httpfs:9874"
host = "http://httpfs:14000"


# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default", 9864)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the comment in interface/Python.md, there's a port discrepancy here. The core-site.xml (lines 90-96) defines fs.defaultFS as ofs://om:9862. Initializing HadoopFileSystem with "default" and port 9864 means it will try to connect to the om host (from fs.defaultFS) but on port 9864, overriding the port 9862 from the configuration.

If the Ozone OFS service listens on port 9862 as per fs.defaultFS, this explicit port 9864 might be incorrect or requires explanation (e.g., if it's a different endpoint or a specific mapping in the Docker setup).

For consistency and clarity, if 9862 is the correct OFS port, consider using:
fs = pafs.HadoopFileSystem("default")
This would rely entirely on the fs.defaultFS property from core-site.xml.

Suggested change
fs = pafs.HadoopFileSystem("default", 9864)
fs = pafs.HadoopFileSystem("default") # Uses host and port from fs.defaultFS in core-site.xml

- Ozone S3 Gateway endpoint and bucket names
- Access credentials (AWS-like key and secret)
- For HttpFS access:
- Requests (`pip install requests`) or fsspec (`pip install fsspec`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For accessing WebHDFS via fsspec, simply pip install fsspec might not be sufficient as webhdfs is often an optional dependency. Could you clarify if a more specific installation like pip install fsspec[webhdfs] or installing a separate webhdfs3 (which fsspec often uses) package is required for the fsspec.implementations.webhdfs example to work out of the box? This would help users avoid potential ModuleNotFoundError or similar issues.

- [Apache Ozone Documentation](https://ozone.apache.org/docs/)
- [PyArrow Documentation](https://arrow.apache.org/docs/python/)
- [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
- [WebHDFS Python Client](https://pypi.org/project/webhdfs/)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link [WebHDFS Python Client](https://pypi.org/project/webhdfs/) points to the webhdfs PyPI package (by mtth). However, the code example for WebHDFS access (lines 212-224) uses from fsspec.implementations.webhdfs import WebHDFS.

These are different libraries/abstractions. To avoid confusion, would it be better to link to the fsspec documentation, or specifically to documentation about its WebHDFS implementation if available? The current link might lead users to a library not used in the example.

# Configure Boto3 client
s3 = boto3.client(
's3',
endpoint_url='http://s3g:9878',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Python script uses endpoint_url='http://s3g:9878'. This URL works if the script is run from within the Docker network (e.g., inside another container that's part of the same docker-compose setup).

However, the tutorial implies installing boto3 in "Your Python Environment" (Step 3), which could be the host machine. If the script is run from the host machine, s3g would not resolve. Assuming the s3g service's port 9878 is mapped to the host (e.g., ports: - "9878:9878" in docker-compose.yaml, which is typical for the Ozone Docker setup), the endpoint URL should be http://localhost:9878.

Could you clarify where the user is expected to run this Python script? If on the host, the endpoint URL and the troubleshooting tip on line 113 (which mentions localhost) should align with this. If inside a container, that should be specified.

Comment on lines +87 to +97
Add the following to `/etc/hadoop/core-site.xml`:

```xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>ofs://om:9862</value>
<description>Ozone Manager endpoint</description>
</property>
</configuration>
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Step 5 instructs adding configuration to /etc/hadoop/core-site.xml. For PyArrow's HadoopFileSystem to pick up this configuration (especially when host="default" is used), the HADOOP_CONF_DIR environment variable typically needs to be set to the directory containing core-site.xml.

The main guide (interface/Python.md) mentions setting OZONE_CONF_DIR for PyArrow setup. Is OZONE_CONF_DIR also respected by PyArrow for finding core-site.xml, or should this tutorial explicitly mention setting HADOOP_CONF_DIR before running the Python script? Adding this detail would make the setup more robust for users.

Change-Id: I1b30bc46aff2b5b521f44fe6d9897496cadd0522
@jojochuang
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds excellent and much-needed documentation for Python developers looking to interact with Apache Ozone. The overall structure, use of code examples, and inclusion of Docker-based recipes are commendable.

I've found a few issues, mainly related to link paths, package names in installation instructions, and some minor inconsistencies or areas for clarification in the recipes. Addressing these will significantly improve the accuracy and usability of these guides.

Summary of Findings

  • Documentation Link Accuracy: Found issues with inconsistent relative paths for links to recipe files in Python.md, which will likely result in broken links (e.g., lines 92, 201 in Python.md).
  • Package Installation Instructions: The troubleshooting section in Python.md (line 232) lists an incorrect package (webhdfs) for WebHDFS access via fsspec. It should be fsspec.
  • Clarity and Consistency: Minor issues regarding clarity in Docker Compose modification instructions and consistency in user names for examples were noted. An outdated troubleshooting tip was also found.

Merge Readiness

The documentation added in this PR is a significant step forward. However, due to the broken links and the incorrect package name in installation instructions, I recommend addressing these issues before merging. The other suggestions should also be considered to improve clarity and user experience. I am unable to approve pull requests, but once these changes are made, the documentation will be in much better shape for users.

</configuration>
```

Try it yourself! Check out [PyArrow Tutorial](recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The relative path to the PyArrow tutorial seems incorrect. Given that Python.md is in content/interface/ and the recipe PyArrowTutorial.md is in content/recipe/, the link should likely use ../ to go up one level from interface before accessing recipe/.

Could you verify this path? A similar link to the Boto3 tutorial (line 130) uses ../recipe/Boto3Tutorial.md, which seems to be the correct pattern.

Suggested change
Try it yourself! Check out [PyArrow Tutorial](recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image.
[PyArrow Tutorial](../recipe/PyArrowTutorial.md)

print(resp_read.text)
```

Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the PyArrow tutorial link, this path to the HTTPFS tutorial seems incorrect. If Python.md is in content/interface/ and recipes are in content/recipe/, this link should likely be ../recipe/PythonRequestsOzoneHttpFS.md to correctly navigate to the recipe file.

Could you please check and update this path for consistency and correctness?

Suggested change
Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image.
[Access Ozone using HTTPFS REST API Tutorial](../recipe/PythonRequestsOzoneHttpFS.md)

- **Authentication Errors**: Verify credentials and Kerberos tokens (if used).
- **Connection Issues**: Check endpoint URLs, ports, and firewall rules.
- **FileSystem Errors**: Ensure correct Ozone configuration and appropriate permissions.
- **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests webhdfs`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The package webhdfs is mentioned here for installation. However, the document introduces fsspec for WebHDFS access (lines 48, 206-210), and fsspec is the common library that provides WebHDFS filesystem implementation. The webhdfs PyPI package seems to be a different, less common one.

Should this be fsspec instead of webhdfs to align with the example provided using fsspec.implementations.webhdfs?

Suggested change
- **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests webhdfs`).
- **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests fsspec`).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 3, 2025

No such command. gemini Available commands:

  • /close : Close pending pull request temporary
  • /help : Show all the available comment commands
  • /label : add new label to the issue: /label <label>
  • /pending : Add a REQUESTED_CHANGE type review to mark issue non-mergeable: /pending <reason>
  • /ready : Dismiss all the blocking reviews by github-actions bot
  • /retest : provide help on how to trigger new CI build

Change-Id: I4715a44d1e75c7e9c0f2bfb347b1ac190e5f8fc5
@jojochuang
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces excellent and much-needed Python developer guides for accessing Apache Ozone. The documentation is well-organized, covering PyArrow, Boto3, and HttpFS (requests/fsspec) with practical Docker-based tutorials. The initial generation via ChatGPT combined with manual verification has resulted in a solid set of guides.

My feedback primarily focuses on minor clarifications and consistency improvements to enhance the developer experience further. The core content and examples are generally accurate and helpful.

Summary of Findings

  • Documentation Date: The date in hadoop-hdds/docs/content/interface/Python.md frontmatter is set to a future date (2025-06-02). This should be clarified or updated.
  • PyArrow HadoopFileSystem Port Parameter: In both Python.md and PyArrowTutorial.md, the HadoopFileSystem("default", port) call includes a port number. When the host is "default", this port parameter is ignored by libhdfs, which relies on fs.defaultFS from core-site.xml. Suggested removing the port or adding a clarifying comment.
  • Boto3 Tutorial Clarity: In Boto3Tutorial.md, a note about S3 bucket naming could be more precise. A troubleshooting tip refers to weichiu-s3g-1 (from the generation prompt) instead of s3g (used in the tutorial's code).
  • Minor Typo (Not commented due to severity settings): In hadoop-hdds/docs/content/recipe/Boto3Tutorial.md (line 52), there's a minor typo: "Create a a bucket" should be "Create a bucket".

Merge Readiness

The pull request significantly improves the Python client documentation for Ozone and is a valuable contribution. The content is largely accurate and well-presented.

I have requested changes for a few medium-severity issues primarily related to documentation clarity and consistency. Addressing these points will further enhance the quality and usability of these guides for developers.

As a reviewer, I am not authorized to approve pull requests. Please ensure these changes are reviewed and approved by a maintainer before merging.


# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default", 9862)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When host is set to "default" in pafs.HadoopFileSystem, the port parameter is ignored. The HDFS C client (libhdfs) will instead use the host and port defined in fs.defaultFS from core-site.xml.

To improve clarity and avoid potential confusion about the 9862 port parameter's role here, consider either:

  1. Removing the port parameter, as it's redundant when host="default".
  2. Adding a comment to explain that the port parameter is ignored in this specific case.

Option 1 (removing the port) is generally cleaner.

Suggested change
fs = pafs.HadoopFileSystem("default", 9862)
fs = pafs.HadoopFileSystem("default")

## Notes

- The S3 Gateway listens on port `9878` by default.
- The bucket name used in Boto3 should match the one created inside your `s3v` volume.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current note says: "The bucket name used in Boto3 should match the one created inside your s3v volume."
This could be slightly confusing as s3v is the Ozone volume, and the S3 bucket name is just the Ozone bucket's name (e.g., bucket).

Could we rephrase for better clarity? For example: "The Bucket parameter in Boto3 calls (e.g., 'bucket' in the script) should be the name of the Ozone bucket created under the s3v volume (i.e., the bucket part of s3v/bucket). The s3v volume itself is implicitly handled by the S3 Gateway."

Suggested change
- The bucket name used in Boto3 should match the one created inside your `s3v` volume.
The `Bucket` parameter in Boto3 calls (e.g., `'bucket'` in the script) should be the name of the Ozone bucket created under the `s3v` volume (i.e., the `bucket` part of `s3v/bucket`). The `s3v` volume itself is implicitly handled by the S3 Gateway.


- **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names).
- **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint.
- **Timeout or DNS Issues**: Try replacing `weichiu-s3g-1` with `localhost` or the actual IP address depending on how your Docker network is set up.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The troubleshooting tip mentions weichiu-s3g-1, which was in the original prompt for ChatGPT. However, the example code in this tutorial (line 79) uses endpoint_url='http://s3g:9878'.

To maintain consistency with the tutorial's code, should this tip refer to s3g instead of weichiu-s3g-1?

Suggested change
- **Timeout or DNS Issues**: Try replacing `weichiu-s3g-1` with `localhost` or the actual IP address depending on how your Docker network is set up.
Try replacing `s3g` with `localhost` or the actual IP address of the S3 Gateway container, depending on your Docker network setup and where you are running the script.


# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default", 9862)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the main Python.md guide, when host is "default" in pafs.HadoopFileSystem, the port parameter (9862 here) is ignored. The connection details are taken from fs.defaultFS in core-site.xml.

For clarity, it's recommended to either remove the port parameter or add a comment explaining its irrelevance in this context. Removing the port is cleaner.

Suggested change
fs = pafs.HadoopFileSystem("default", 9862)
fs = pafs.HadoopFileSystem("default")

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 3, 2025

No such command. gemini Available commands:

  • /close : Close pending pull request temporary
  • /help : Show all the available comment commands
  • /label : add new label to the issue: /label <label>
  • /pending : Add a REQUESTED_CHANGE type review to mark issue non-mergeable: /pending <reason>
  • /ready : Dismiss all the blocking reviews by github-actions bot
  • /retest : provide help on how to trigger new CI build

Change-Id: I29bc9476a100190f3c7a927f5b68e6fbb36c5c79
@jojochuang
Copy link
Copy Markdown
Owner Author

/gemini

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 4, 2025

No such command. gemini Available commands:

  • /close : Close pending pull request temporary
  • /help : Show all the available comment commands
  • /label : add new label to the issue: /label <label>
  • /pending : Add a REQUESTED_CHANGE type review to mark issue non-mergeable: /pending <reason>
  • /ready : Dismiss all the blocking reviews by github-actions bot
  • /retest : provide help on how to trigger new CI build

@gemini-code-assist
Copy link
Copy Markdown

Hi @jojochuang, it looks like the command /gemini is not a recognized command. If you intended to request a code review, please use the command /gemini review. You can also see the list of available commands by typing /gemini help.

@jojochuang
Copy link
Copy Markdown
Owner Author

/gemini review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 4, 2025

No such command. gemini Available commands:

  • /close : Close pending pull request temporary
  • /help : Show all the available comment commands
  • /label : add new label to the issue: /label <label>
  • /pending : Add a REQUESTED_CHANGE type review to mark issue non-mergeable: /pending <reason>
  • /ready : Dismiss all the blocking reviews by github-actions bot
  • /retest : provide help on how to trigger new CI build

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the documentation for Python developers looking to integrate with Apache Ozone. The inclusion of guides for PyArrow, Boto3, and HTTPFS (via requests and fsspec) covers the primary access methods well. The step-by-step tutorials using Docker are a fantastic way for users to quickly test and understand these integrations.

The overall structure and content are good, stemming from a well-defined prompt and manual verification. I've identified a few areas for improvement, primarily concerning clarity, consistency, and a critical configuration detail in one of the tutorials.

Summary of Findings

  • Documentation Date: The date in Python.md frontmatter is set to the future (2025). This should likely be a current or past date.
  • Configuration Directory Variable: Clarification requested for OZONE_CONF_DIR vs. HADOOP_CONF_DIR in Python.md for broader applicability beyond the specific Docker setup.
  • User Consistency in HttpFS Examples: In Python.md, the fsspec example uses user='hdfs' while the requests example uses user='ozone' for HttpFS. Consistency or explanation for the difference is needed.
  • Troubleshooting Tip Context: The Boto3Tutorial.md troubleshooting tip for s3g vs. localhost needs context clarification, as the tutorial assumes execution within a Docker container.
  • Critical Docker Compose Configuration: The PythonRequestsOzoneHttpFS.md tutorial has a docker-compose.yaml snippet for httpfs that is likely missing the parent services: key, which would make the configuration invalid.
  • Shebang in Python Scripts (Not Commented): Python scripts in examples use #!/usr/bin/python. Consider #!/usr/bin/env python3 or #!/usr/bin/env python for better portability. (Low severity, not commented per settings)
  • Repeated Installation Instructions (Not Commented): Some pip install instructions are repeated from prerequisites within specific method sections in Python.md. This is minor. (Low severity, not commented per settings)

Merge Readiness

The documentation added in this PR is valuable. However, there is one critical issue regarding the docker-compose.yaml configuration in the HTTPFS tutorial that must be addressed before merging. The other medium-severity items should also be considered for clarification and improvement to ensure the guides are as accurate and easy to follow as possible.

I am unable to approve pull requests directly. Please address the identified issues, particularly the critical one, and then seek further review and approval from other maintainers.

Comment on lines +44 to +56
```yaml
httpfs:
<<: *image
ports:
- 14000:14000
environment:
CORE-SITE.XML_fs.defaultFS: "ofs://om"
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1}
<<: *common-config
command: [ "ozone","httpfs" ]
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The YAML snippet provided for modifying docker-compose.yaml to add the httpfs service appears to be missing the parent services: key.

In a docker-compose.yaml file, service definitions like httpfs: should be nested under a top-level services: key. As written, with httpfs: indented at the same level as a potential version: key, this would likely result in an invalid docker-compose.yaml file.

Could you verify the correct structure? It should likely be:

services:
  # ... other services (om, scm, datanode, s3g) ...
  httpfs:
    <<: *image
    ports:
      - "14000:14000"
    environment:
      CORE-SITE.XML_fs.defaultFS: "ofs://om"
      CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
      CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
      OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1}
      <<: *common-config
    command: [ "ozone","httpfs" ]

Ensuring this structure is correct is crucial for the tutorial to work.

Suggested change
```yaml
httpfs:
<<: *image
ports:
- 14000:14000
environment:
CORE-SITE.XML_fs.defaultFS: "ofs://om"
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1}
<<: *common-config
command: [ "ozone","httpfs" ]
```
services:
httpfs:
<<: *image
ports:
- "14000:14000"
environment:
CORE-SITE.XML_fs.defaultFS: "ofs://om"
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1}
<<: *common-config
command: [ "ozone","httpfs" ]

Comment on lines +56 to +57
Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and `OZONE_CONF_DIR` is set.
Also ensure `ARROW_LIBHDFS_DIR` and `CLASSPATH` are set properly.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The document mentions ensuring OZONE_CONF_DIR is set for PyArrow configuration. The PyArrow tutorial clarifies that OZONE_CONF_DIR is set within the Ozone Docker environment.

For users setting up PyArrow access outside of that specific Docker context (e.g., on a system where Hadoop is installed separately), would HADOOP_CONF_DIR be the more conventional environment variable to point to core-site.xml and ozone-site.xml?

Could it be beneficial to add a note clarifying this, or mentioning that HADOOP_CONF_DIR might also be used depending on the environment?

```python
from fsspec.implementations.webhdfs import WebHDFS

fs = WebHDFS(host='httpfs', port=14000, user='hdfs')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the fsspec (WebHDFS) code example, the user is specified as user='hdfs'. However, in the requests example for HttpFS (line 154), user = "ozone" is used, and the tutorial PythonRequestsOzoneHttpFS.md also uses user = "ozone".

Is this difference in user (hdfs vs. ozone) intentional and correct for fsspec versus requests when interacting with HttpFS in the described setup? If so, a brief note explaining why might be helpful. If not, should they be consistent (e.g., both use ozone or another user configured for proxy access)?


- **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names).
- **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint.
- **Timeout or DNS Issues**: Try replacing `s3g` with `localhost` or the actual IP address depending on how your Docker network is set up.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The troubleshooting tip suggests replacing s3g with localhost or an IP address. This is a good general tip. However, the tutorial explicitly states (line 50) that "The rest of the tutorial will run on this container" (the SCM container).

If the Python script is indeed run from within the SCM container (or any container on the same Docker network), s3g (as the S3 Gateway's service name) should resolve correctly. The localhost suggestion would typically apply if the script were run from the host machine with ports mapped, or if the S3G was on the same container (which it isn't here).

Could you clarify the context for this tip? Perhaps rephrase to something like: "If you adapt this script to run from your host machine (outside Docker), you might need to replace s3g:9878 with localhost:9878 (assuming default port mapping)."?

Change-Id: I4ce9349ee9f4e53dc4f53009c784eef04bf857f3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant