Skip to content

Conversation

@jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-13165. [Docs] Python client developer guide.

Please describe your PR in detail:

  • Added interface/Python.md: overall Python client access introduction.
  • Recipe: Access Ozone using PyArrow (Docker Quickstart)
  • Recipe: Access Ozone using Boto3 (Docker Quickstart)
  • Recipe: Access Ozone using HTTPFS REST API (Docker + Python Requests)

For interface/Python.md, the draft was generated using ChatGPT 4o using the prompt:

Create a user document in Markdown format for Python developers who want to access Apache Ozone. This document will be part of the Ozone Client Interfaces page: https://ozone.apache.org/docs/edge/interface.html.

📌 *Audience*: Python developers familiar with Python integration and Ozone. Skip the introduction.

📌 *Structure*:

Setup and Prerequisites:
Required libraries (PyArrow, Boto3, WebHDFS)
Required configurations (e.g., HADOOP_CONF_DIR, Ozone URIs, credentials, authentication)
Access Method 1: PyArrow with libhdfs
Setup steps (including any system paths or environment variables)
Python code sample (validate for correctness)
Access Method 2: Boto3 with Ozone S3 Gateway
Setup steps (including Ozone S3 endpoint format, bucket naming conventions, credentials)
Python code sample (validate for correctness)
Access Method 3: WebHDFS/HttpFS or REST API
Setup steps (including endpoint URL, authentication)
Python code sample (using requests or webhdfs)
Access from PySpark
Configuration settings in Spark (fs.ozone. settings)
Python code sample for reading/writing data to Ozone
Troubleshooting Tips
Common issues (e.g., authentication failures, connection errors)
Suggested debugging techniques
References and Further Resources
Links to official Ozone documentation, PyArrow, Boto3, WebHDFS, PySpark
📌 *Markdown Format*:

Use proper headers (##, ###) for each section.
Include Python syntax highlighting in code blocks (```python).
Use clear formatting and spacing for readability.
Include warnings or notes where appropriate (e.g., > *Note:*).
If applicable, include a simple diagram showing connection flows.
📌 *Quality Checks*:

Validate all code samples for correctness.
Ensure the document is clear and concise.
Focus only on actionable instructions and setup information.
Generate the complete Markdown document in response. Include a Hugo header. Include Apache License header

The PyArrow recipe draft was generated using ChatGPT 4o prompt:

I personally verified the following steps using Ozone's Docker image. Please rewrite in a user tutorial format.

PyArrow to access Ozone

# Download the latest Docker Compose configuration file
curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml

docker compose up -d --scale datanode=3


connect to the SCM container:

docker exec -it weichiu-scm-1 bash

ozone sh volume create volume
ozone sh bucket create volume/bucket

pip install pyarrow


curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’
or
curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’

export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
export CLASSPATH=$(ozone classpath ozone-tools)

Add to /etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>ofs://om:9862</value>
                <description>Where HDFS NameNode can be found on the network</description>
        </property>
</configuration>


Code:

#!/usr/bin/python
import pyarrow.fs as pafs

# Create Hadoop FileSystem object
fs = pafs.HadoopFileSystem("default", 9864)

fs.create_dir("volume/bucket/aaa")

path = "volume/bucket/file1"
with fs.open_output_stream(path) as stream:
        stream.write(b'data')

The Boto3 recipe draft was generated using ChatGPT 4o prompt:

Following the similar PyArrow using Ozone Docker image tutorial, create a similar one for boto3 using the following instructions:

ozone sh bucket create s3v/bucket



Code

#!/usr/bin/python
import boto3

# Create a local file to upload
with open("localfile.txt", "w") as f:
    f.write("Hello from Ozone via Boto3!\n")

# Configure Boto3 client
s3 = boto3.client(
's3',
endpoint_url='http://weichiu-s3g-1:9878',
aws_access_key_id='ozone-access-key',
aws_secret_access_key='ozone-secret-key'
)

# List buckets
response = s3.list_buckets()
print(response['Buckets'])

# Upload a file
s3.upload_file('localfile.txt', 'bucket', 'file.txt')

# Download a file
s3.download_file('bucket', 'file.txt', 'downloaded.txt’)

The httpfs receipe draft was generated using ChatGPT 4o, prompt:

Use the following instructions to create a tutorial of accessing Ozone using HttpFS REST API via requests library

Ozone httpfs using Python requests



# Download the latest Docker Compose configuration file
curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
add to docker-compose.yaml:   CORE-SITE.XML_fs.defaultFS: "ofs://om"
   CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
   CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"

docker compose up -d --scale datanode=3


connect to the SCM container:



ozone sh volume create vol1 
ozone sh bucket create vol1/bucket1

pip install requests

#!/usr/bin/python
import requests

# Ozone HTTPFS endpoint and file path
host = "http://weichiu-httpfs-1:14000"
volume = “vol1"
bucket = "bucket1"
filename = "hello.txt"
path = f"/webhdfs/v1/{volume}/{bucket}/{filename}"
user = "ozone"  # can be any value in simple auth mode

# Step 1: Initiate file creation (responds with 307 redirect)
params_create = {
    "op": "CREATE",
    "overwrite": "true",
    "user.name": user
}

print("Creating file...")
resp_create = requests.put(host + path, params=params_create, allow_redirects=False)

if resp_create.status_code != 307:
    print(f"Unexpected response: {resp_create.status_code}")
    print(resp_create.text)
    exit(1)

redirect_url = resp_create.headers['Location']
print(f"Redirected to: {redirect_url}")

# Step 2: Write data to the redirected location with correct headers
headers = {"Content-Type": "application/octet-stream"}
content = b"Hello from Ozone HTTPFS!\n"

resp_upload = requests.put(redirect_url, data=content, headers=headers)
if resp_upload.status_code != 201:
    print(f"Upload failed: {resp_upload.status_code}")
    print(resp_upload.text)
    exit(1)
print("File created successfully.")

# Step 3: Read the file back
params_open = {
    "op": "OPEN",
    "user.name": user
}

print("Reading file...")
resp_read = requests.get(host + path, params=params_open, allow_redirects=True)
if resp_read.ok:
    print("File contents:")
    print(resp_read.text)
else:
    print(f"Read failed: {resp_read.status_code}")
    print(resp_read.text)

Note: initially the draft had PySpark content. Due to the length of the content, I decided to leave it out. Will work on it in a follow-up task.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13165?filter=-1

How was this patch tested?

After Gemini/ChatGPT generated the user doc draft, I manually followed the code samples and verified the steps.

@jojochuang jojochuang requested a review from Copilot June 3, 2025 23:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive documentation for Python client access to Apache Ozone. It includes three new tutorial documents—one each for HTTPFS via Python Requests, PyArrow, and Boto3—and an updated interface document summarizing the available Python access methods.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
docs/content/recipe/PythonRequestsOzoneHttpFS.md New tutorial for accessing Ozone via HTTPFS REST API using requests
docs/content/recipe/PyArrowTutorial.md New tutorial for accessing Ozone via PyArrow in a Docker environment
docs/content/recipe/Boto3Tutorial.md New tutorial for accessing Ozone via Boto3 using the S3 Gateway
docs/content/interface/Python.md Updated overview of Python client interfaces with configuration and code examples

Change-Id: I1b30bc46aff2b5b521f44fe6d9897496cadd0522
Change-Id: I4715a44d1e75c7e9c0f2bfb347b1ac190e5f8fc5
Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang for this docs

Change-Id: I29bc9476a100190f3c7a927f5b68e6fbb36c5c79
@adoroszlai adoroszlai added the documentation Improvements or additions to documentation label Jun 4, 2025
Change-Id: I4ce9349ee9f4e53dc4f53009c784eef04bf857f3
@jojochuang jojochuang marked this pull request as ready for review June 4, 2025 06:07
@jojochuang
Copy link
Contributor Author

Open up for review.

Please also check out the review comments previously made by Gemini: jojochuang#566

@jojochuang
Copy link
Contributor Author

add @adoroszlai for the parts regarding httpfs

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang for working on this.

@peterxcli peterxcli merged commit 31d13de into apache:master Jun 8, 2025
14 of 25 checks passed
jojochuang added a commit to jojochuang/ozone that referenced this pull request Jun 8, 2025
aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 9, 2025
…239-container-reconciliation

Commits: 80 commits
5e273a4 HDDS-12977. Fail build on dependency problems (apache#8574)
5081ba2 HDDS-13034. Refactor DirectoryDeletingService to use ReclaimableDirFilter and ReclaimableKeyFilter (apache#8546)
e936e4d HDDS-12134. Implement Snapshot Cache lock for OM Bootstrap (apache#8474)
31d13de HDDS-13165. [Docs] Python client developer guide. (apache#8556)
9e6955e HDDS-13205. Bump common-custom-user-data-maven-extension to 2.0.3 (apache#8581)
750b629 HDDS-13203. Bump Bouncy Castle to 1.81 (apache#8580)
ba5177e HDDS-13202. Bump build-helper-maven-plugin to 3.6.1 (apache#8579)
07ee5dd HDDS-13204. Bump awssdk to 2.31.59 (apache#8582)
e1964f2 HDDS-13201. Bump jersey2 to 2.47 (apache#8578)
81295a5 HDDS-13013. [Snapshot] Add metrics and tests for snapshot operations. (apache#8436)
b3d75ab HDDS-12976. Clean up unused dependencies (apache#8521)
e0f08b2 HDDS-13179. rename-generated-config fails on re-compile without clean (apache#8569)
f388317 HDDS-12554. Support callback on completed reconfiguration (apache#8391)
c13a3fe HDDS-13154 Link more Grafana dashboard json files to the Observability user doc (apache#8533)
2a761f7 HDDS-11967. [Docs]DistCP Integration in Kerberized environment. (apache#8531)
81fc4c4 HDDS-12550. Use DatanodeID instead of UUID in NodeManager CommandQueue. (apache#8560)
2360af4 HDDS-13169. Intermittent failure in testSnapshotOperationsNotBlockedDuringCompaction (apache#8553)
f19789d HDDS-13170. Reclaimable filter should always reclaim entries when buckets and volumes have already been deleted (apache#8551)
315ef20 HDDS-13175. Leftover reference to OM-specific trash implementation (apache#8563)
902e715 HDDS-13159. Refactor KeyManagerImpl for getting deleted subdirectories and deleted subFiles (apache#8538)
46a93d0 HDDS-12817. Addendum rename ecIndex to replicaIndex in chunkinfo output (apache#8552)
19b9b9c HDDS-13166. Set pipeline ID in BlockExistenceVerifier to avoid cached pipeline with different node (apache#8549)
b3ff67c HDDS-13068. Validate Container Balancer move timeout and replication timeout configs (apache#8490)
7a7b9a8 HDDS-13139. Introduce bucket layout flag in freon rk command (apache#8539)
3c25e7d HDDS-12595. Add verifier for container replica states (apache#8422)
6d59220 HDDS-13104. Move auditparser acceptance test under debug (apache#8527)
8e8c432 HDDS-13071. Documentation for Container Replica Debugger Tool (apache#8485)
0e8c8d4 HDDS-13158. Bump junit to 5.13.0 (apache#8537)
8e552b4 HDDS-13157. Bump exec-maven-plugin to 3.5.1 (apache#8534)
168f690 HDDS-13155. Bump jline to 3.30.4 (apache#8535)
cc1e4d1 HDDS-13156. Bump awssdk to 2.31.54 (apache#8536)
3bfb7af HDDS-13136. KeyDeleting Service should not run for already deep cleaned snapshots (apache#8525)
006e691 HDDS-12503. Compact snapshot DB before evicting a snapshot out of cache (apache#8141)
568b228 HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past (apache#8491)
53673c5 HDDS-11244. OmPurgeDirectoriesRequest should clean up File and Directory tables of AOS for deleted snapshot directories (apache#8509)
07f4868 HDDS-13099. ozone admin datanode list ignores --json flag when --id filter is used (apache#8500)
08c0ab8 HDDS-13075. Fix default value in description of container placement policy configs (apache#8511)
58c87a8 HDDS-12177. Set runtime scope where missing (apache#8513)
10c470d HDDS-12817. Add EC block index in the ozone debug replicas chunk-info (apache#8515)
7027ab7 HDDS-13124. Respect config hdds.datanode.use.datanode.hostname when reading from datanode (apache#8518)
b8b226c HDDS-12928. datanode min free space configuration (apache#8388)
fd3d70c HDDS-13026. KeyDeletingService should also delete RenameEntries (apache#8447)
4c1c6cf HDDS-12714. Create acceptance test framework for debug and repair tools (apache#8510)
fff80fc HDDS-13118. Remove duplicate mockito-core dependency from hdds-test-utils (apache#8508)
10d5555 HDDS-13115. Bump awssdk to 2.31.50 (apache#8505)
360d139 HDDS-13017. Fix warnings due to non-test scoped test dependencies (apache#8479)
1db1cca HDDS-13116. Bump jline to 3.30.3 (apache#8504)
322ca93 HDDS-13025. Refactor KeyDeletingService to use ReclaimableKeyFilter (apache#8450)
988b447 HDDS-5287. Document S3 ACL classes (apache#8501)
64bb29d HDDS-12777. Use module-specific name for generated config files (apache#8475)
54ed115 HDDS-9210. Update snapshot chain restore test to incorporate snapshot delete. (apache#8484)
87dfa5a HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance (apache#8438)
7cdc865 HDDS-13100. ozone admin datanode list --json should output a newline at the end (apache#8499)
9cc4194 HDDS-13089. [snapshot] Add an integration test to verify snapshotted data can be read by S3 SDK client (apache#8495)
cb9867b HDDS-13065. Refactor SnapshotCache to return AutoCloseSupplier instead of ReferenceCounted (apache#8473)
a88ff71 HDDS-10979. Support STANDARD_IA S3 storage class to accept EC replication config (apache#8399)
6ec8f85 HDDS-13080. Improve delete metrics to show number of timeout DN command from SCM (apache#8497)
3bb8858 HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 (apache#8331)
0171bef HDDS-13073. Set pipeline ID in checksums verifier to avoid cached pipeline with different node (apache#8480)
5c7726a HDDS-11539. OzoneClientCache `@PreDestroy` is never called (apache#8493)
a8ed19b HDDS-13031. Implement a Flat Lock resource in OzoneManagerLock (apache#8446)
e9e8b30 HDDS-12935. Support unsigned chunked upload with STREAMING-UNSIGNED-PAYLOAD-TRAILER (apache#8366)
7590268 HDDS-13079. Improve logging in DN for delete operation. (apache#8489)
435fe7e HDDS-12870. Fix listObjects corner cases (apache#8307)
eb5dabd HDDS-12926. Remove *.tmp.* exclusion in DU (apache#8486)
eeb98c7 HDDS-13030. Snapshot Purge should unset deep cleaning flag for next 2 snapshots in the chain (apache#8451)
6bf121e HDDS-13032. Support proper S3OwnerId representation (apache#8478)
5d1b43d HDDS-13076. Refactor OzoneManagerLock class to rename Resource class to LeveledResource (apache#8482)
bafe6d9 HDDS-13064. [snapshot] Add test coverage for SnapshotUtils.isBlockLocationInfoSame() (apache#8476)
7035846 HDDS-13040. Add user doc highlighting the difference between Ozone ACL and S3 ACL. (apache#8457)
1825cdf HDDS-13049. Deprecate VolumeName & BucketName in OmKeyPurgeRequest and prevent Key version purge on Block Deletion Failure (apache#8463)
211c76c HDDS-13060. Change NodeManager.addDatanodeCommand(..) to use DatanodeID (apache#8471)
f410238 HDDS-13061. Add test for key ACL operations without permission (apache#8472)
d1a2f48 HDDS-13057. Increment block delete processed transaction counts regardless of log level (apache#8466)
0cc6fcc HDDS-13043. Replace != with assertNotEquals in TestSCMContainerPlacementRackAware (apache#8470)
e1c779a HDDS-13051. Use DatanodeID in server-scm. (apache#8465)
35e1126 HDDS-13042. [snapshot] Add future proofing test cases for unsupported file system API (apache#8458)
619c05d HDDS-13008. Exclude same SST files when calculating full snapdiff (apache#8423)
21b49d3 HDDS-12965. Fix warnings about "used undeclared" dependencies (apache#8468)
8136119 HDDS-13048. Create new module for Recon integration tests (apache#8464)

Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants