apache · peterxcli · Jun 8, 2025 · Jun 3, 2025 · Jun 3, 2025 · Jun 3, 2025
diff --git a/hadoop-hdds/docs/content/interface/Python.md b/hadoop-hdds/docs/content/interface/Python.md
@@ -0,0 +1,239 @@
+---
+title: "Accessing Apache Ozone from Python"
+date: "2025-06-02"
+weight: 6
+menu:
+  main:
+    parent: "Client Interfaces"
+summary: Access Apache Ozone from Python using PyArrow, Boto3, requests and fsspec WebHDFS libraries
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Apache Ozone project itself does not provide Python client libraries.
+However, several third-party open source libraries can be used to build applications to access an Ozone cluster via
+different interfaces: OFS file system, Ozone HTTPFS REST API and Ozone S3.
+
+This document outlines these approaches, providing concise setup instructions and validated code examples.
+
+## Setup and Prerequisites
+
+Before starting, ensure the following:
+
+- Python installed (3.x recommended)
+- Apache Ozone configured and accessible
+- For PyArrow with libhdfs:
+  - PyArrow library (`pip install pyarrow`)
+  - Hadoop native libraries configured and Ozone classpath specified (see below for details)
+- For S3 access:
+  - Boto3 (`pip install boto3`)
+  - Ozone S3 Gateway endpoint and bucket names
+  - Access credentials (AWS-like key and secret)
+- For HttpFS access:
+  - Requests (`pip install requests`) or fsspec (`pip install fsspec`)
+
+## Method 1: Access Ozone via PyArrow and libhdfs
+
+This approach leverages PyArrow's HadoopFileSystem API, which requires libhdfs.so native library.
+The libhdfs.so is not packaged within PyArrow and you must download it separately from Hadoop.
+
+### Configuration
+Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and `OZONE_CONF_DIR` is set.
+Also ensure `ARROW_LIBHDFS_DIR` and `CLASSPATH` are set properly.
+
+For example,
+```shell
+export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
+export CLASSPATH=$(ozone classpath ozone-tools)
+```
+
+### Code Example
+```python
+import pyarrow.fs as pafs
+
+# Connect to Ozone using HadoopFileSystem
+# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
+fs = pafs.HadoopFileSystem("default")
+
+# Create a directory inside the bucket
+fs.create_dir("volume/bucket/aaa")
+
+# Write data to a file
+path = "volume/bucket/file1"
+with fs.open_output_stream(path) as stream:
+  stream.write(b'data')
+```
+> **Note:** configure fs.defaultFS in the core-site.xml to point to the Ozone cluster. For example,
+```xml
+<configuration>
+  <property>
+    <name>fs.defaultFS</name>
+    <value>ofs://om:9862</value>
+    <description>Ozone Manager endpoint</description>
+  </property>
+</configuration>
+```
+
+Try it yourself! Check out [PyArrow Tutorial](../recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image.
+
+## Method 2: Access Ozone via Boto3 and S3 Gateway
+
+### Configuration
+- Identify your Ozone S3 Gateway endpoint (e.g., `http://s3g:9878`).
+- Use AWS-compatible credentials (from Ozone).
+
+### Code Example
+```python
+import boto3
+
+# Create a local file to upload
+with open("localfile.txt", "w") as f:
+  f.write("Hello from Ozone via Boto3!\n")
+
+# Configure Boto3 client
+s3 = boto3.client(
+  's3',
+  endpoint_url='http://s3g:9878',
+  aws_access_key_id='ozone-access-key',
+  aws_secret_access_key='ozone-secret-key'
+)
+
+# List buckets
+response = s3.list_buckets()
+print("Buckets:", response['Buckets'])
+
+# Upload the file
+s3.upload_file('localfile.txt', 'bucket', 'file.txt')
+print("Uploaded 'localfile.txt' to 'bucket/file.txt'")
+
+# Download the file back
+s3.download_file('bucket', 'file.txt', 'downloaded.txt')
+print("Downloaded 'file.txt' as 'downloaded.txt'")
+```
+> **Note:** Replace endpoint URL, credentials, and bucket names with your setup.
+
+Try it yourself! Check out [Boto3 Tutorial](../recipe/Boto3Tutorial.md) for a quick start using Ozone's Docker image.
+
+
+## Method 3: Access Ozone via HttpFS REST API
+First, install requests Python module:
+
+```shell
+pip install requests
+```
+
+### Configuration
+- Use Ozone’s HTTPFS endpoint (e.g., `http://httpfs:14000`).
+
+### Code Example (requests)
+```python
+#!/usr/bin/python
+import requests
+
+# Ozone HTTPFS endpoint and file path
+host = "http://httpfs:14000"
+volume = "vol1"
+bucket = "bucket1"
+filename = "hello.txt"
+path = f"/webhdfs/v1/{volume}/{bucket}/{filename}"
+user = "ozone"  # can be any value in simple auth mode
+
+# Step 1: Initiate file creation (responds with 307 redirect)
+params_create = {
+  "op": "CREATE",
+  "overwrite": "true",
+  "user.name": user
+}
+
+print("Creating file...")
+resp_create = requests.put(host + path, params=params_create, allow_redirects=False)
+
+if resp_create.status_code != 307:
+  print(f"Unexpected response: {resp_create.status_code}")
+  print(resp_create.text)
+  exit(1)
+
+redirect_url = resp_create.headers['Location']
+print(f"Redirected to: {redirect_url}")
+
+# Step 2: Write data to the redirected location with correct headers
+headers = {"Content-Type": "application/octet-stream"}
+content = b"Hello from Ozone HTTPFS!\n"
+
+resp_upload = requests.put(redirect_url, data=content, headers=headers)
+if resp_upload.status_code != 201:
+  print(f"Upload failed: {resp_upload.status_code}")
+  print(resp_upload.text)
+  exit(1)
+print("File created successfully.")
+
+# Step 3: Read the file back
+params_open = {
+  "op": "OPEN",
+  "user.name": user
+}
+
+print("Reading file...")
+resp_read = requests.get(host + path, params=params_open, allow_redirects=True)
+if resp_read.ok:
+  print("File contents:")
+  print(resp_read.text)
+else:
+  print(f"Read failed: {resp_read.status_code}")
+  print(resp_read.text)
+```
+
+Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](../recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image.
+
+
+### Code Example (webhdfs)
+
+First, install fsspec Python module:
+
+```shell
+pip install fsspec
+```
+
+```python
+from fsspec.implementations.webhdfs import WebHDFS
+
+fs = WebHDFS(host='httpfs', port=14000, user='ozone')
+
+# Read a file from /vol1/bucket1/hello.txt
+file_path = "/vol1/bucket1/hello.txt"
+
+with fs.open(file_path, mode='rb') as f:
+  content = f.read()
+  print("File contents:")
+  print(content.decode('utf-8'))
+```
+> **Note:** Replace host, port, and path as per your setup.
+
+## Troubleshooting Tips
+
+- **Authentication Errors**: Verify credentials and Kerberos tokens (if used).
+- **Connection Issues**: Check endpoint URLs, ports, and firewall rules.
+- **FileSystem Errors**: Ensure correct Ozone configuration and appropriate permissions.
+- **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests fsspec`).
+
+## References and Further Resources
+
+- [Apache Ozone Documentation](https://ozone.apache.org/docs/)
+- [PyArrow Documentation](https://arrow.apache.org/docs/python/)
+- [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
+- [fsspec WebHDFS Python API](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.webhdfs.WebHDFS)
diff --git a/hadoop-hdds/docs/content/recipe/Boto3Tutorial.md b/hadoop-hdds/docs/content/recipe/Boto3Tutorial.md
@@ -0,0 +1,123 @@
+---
+title: Access Ozone using Boto3 (Docker Quickstart)
+linkTitle: Boto3 Access (Docker)
+description: Step-by-step tutorial for accessing Ozone from Python using Boto3 and the S3 Gateway in a Docker environment.
+weight: 12
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+This tutorial demonstrates how to access Apache Ozone from Python using **Boto3**, via Ozone's S3 Gateway, with Ozone running in Docker.
+
+## Prerequisites
+
+- Docker and Docker Compose installed.
+- Python 3.x environment.
+
+## Steps
+
+### 1️⃣ Start Ozone in Docker
+
+Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes:
+
+```bash
+curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
+docker compose up -d --scale datanode=3
+```
+
+### 2️⃣ Connect to the SCM Container
+
+```bash
+docker exec -it <your-scm-container-name-or-id> bash
+```
+> Change the container id `<your-scm-container-name-or-id>` to your actual container id.
+
+The rest of the tutorial will run on this container.
+
+Create a **bucket** inside the volume **s3v**:
+
+```bash
+ozone sh bucket create s3v/bucket
+```
+
+### 3️⃣ Install Boto3 in Your Python Environment
+
+```bash
+pip install boto3
+```
+
+### 4️⃣ Access Ozone via Boto3 and the S3 Gateway
+
+Create a Python script (`ozone_boto3_example.py`) with the following content:
+
+```python
+#!/usr/bin/python
+import boto3
+
+# Create a local file to upload
+with open("localfile.txt", "w") as f:
+    f.write("Hello from Ozone via Boto3!\n")
+
+# Configure Boto3 client
+s3 = boto3.client(
+    's3',
+    endpoint_url='http://s3g:9878',
+    aws_access_key_id='ozone-access-key',
+    aws_secret_access_key='ozone-secret-key'
+)
+
+# List buckets
+response = s3.list_buckets()
+print("Buckets:", response['Buckets'])
+
+# Upload the file
+s3.upload_file('localfile.txt', 'bucket', 'file.txt')
+print("Uploaded 'localfile.txt' to 'bucket/file.txt'")
+
+# Download the file back
+s3.download_file('bucket', 'file.txt', 'downloaded.txt')
+print("Downloaded 'file.txt' as 'downloaded.txt'")
+```
+
+Run the script:
+
+```bash
+python ozone_boto3_example.py
+```
+
+✅ You have now accessed Ozone from Python using Boto3 and verified both upload and download operations.
+
+## Notes
+
+- The S3 Gateway listens on port `9878` by default.
+- The `Bucket` parameter in Boto3 calls (e.g., `'bucket'` in the script) should be the name of the Ozone bucket created under the `s3v` volume (i.e., the `bucket` part of `s3v/bucket`). The `s3v` volume itself is implicitly handled by the S3 Gateway.
+- Make sure the S3 Gateway container (`s3g`) is up and running. You can check using `docker ps`.
+
+## Troubleshooting Tips
+
+- **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names).
+- **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint.
+- **Timeout or DNS Issues**: If you adapt this script to run from your host machine (outside Docker), you might need to replace s3g:9878 with localhost:9878 (assuming default port mapping).
+
+## References
+
+- [Apache Ozone Docker](https://github.com/apache/ozone-docker)
+- [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
+- [Ozone S3 Docs](https://ozone.apache.org/docs/edge/interface/s3.html)
+- [Ozone Securing S3 Docs](https://ozone.apache.org/docs/edge/security/securings3.html)
+- [Ozone Client Interfaces](https://ozone.apache.org/docs/edge/interface.html)