diff --git a/hadoop-hdds/docs/content/interface/Python.md b/hadoop-hdds/docs/content/interface/Python.md new file mode 100644 index 000000000000..03237c54a2dc --- /dev/null +++ b/hadoop-hdds/docs/content/interface/Python.md @@ -0,0 +1,239 @@ +--- +title: "Accessing Apache Ozone from Python" +date: "2025-06-02" +weight: 6 +menu: + main: + parent: "Client Interfaces" +summary: Access Apache Ozone from Python using PyArrow, Boto3, requests and fsspec WebHDFS libraries +--- + + + +Apache Ozone project itself does not provide Python client libraries. +However, several third-party open source libraries can be used to build applications to access an Ozone cluster via +different interfaces: OFS file system, Ozone HTTPFS REST API and Ozone S3. + +This document outlines these approaches, providing concise setup instructions and validated code examples. + +## Setup and Prerequisites + +Before starting, ensure the following: + +- Python installed (3.x recommended) +- Apache Ozone configured and accessible +- For PyArrow with libhdfs: + - PyArrow library (`pip install pyarrow`) + - Hadoop native libraries configured and Ozone classpath specified (see below for details) +- For S3 access: + - Boto3 (`pip install boto3`) + - Ozone S3 Gateway endpoint and bucket names + - Access credentials (AWS-like key and secret) +- For HttpFS access: + - Requests (`pip install requests`) or fsspec (`pip install fsspec`) + +## Method 1: Access Ozone via PyArrow and libhdfs + +This approach leverages PyArrow's HadoopFileSystem API, which requires libhdfs.so native library. +The libhdfs.so is not packaged within PyArrow and you must download it separately from Hadoop. + +### Configuration +Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and `OZONE_CONF_DIR` is set. +Also ensure `ARROW_LIBHDFS_DIR` and `CLASSPATH` are set properly. + +For example, +```shell +export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/ +export CLASSPATH=$(ozone classpath ozone-tools) +``` + +### Code Example +```python +import pyarrow.fs as pafs + +# Connect to Ozone using HadoopFileSystem +# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml +fs = pafs.HadoopFileSystem("default") + +# Create a directory inside the bucket +fs.create_dir("volume/bucket/aaa") + +# Write data to a file +path = "volume/bucket/file1" +with fs.open_output_stream(path) as stream: + stream.write(b'data') +``` +> **Note:** configure fs.defaultFS in the core-site.xml to point to the Ozone cluster. For example, +```xml + + + fs.defaultFS + ofs://om:9862 + Ozone Manager endpoint + + +``` + +Try it yourself! Check out [PyArrow Tutorial](../recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image. + +## Method 2: Access Ozone via Boto3 and S3 Gateway + +### Configuration +- Identify your Ozone S3 Gateway endpoint (e.g., `http://s3g:9878`). +- Use AWS-compatible credentials (from Ozone). + +### Code Example +```python +import boto3 + +# Create a local file to upload +with open("localfile.txt", "w") as f: + f.write("Hello from Ozone via Boto3!\n") + +# Configure Boto3 client +s3 = boto3.client( + 's3', + endpoint_url='http://s3g:9878', + aws_access_key_id='ozone-access-key', + aws_secret_access_key='ozone-secret-key' +) + +# List buckets +response = s3.list_buckets() +print("Buckets:", response['Buckets']) + +# Upload the file +s3.upload_file('localfile.txt', 'bucket', 'file.txt') +print("Uploaded 'localfile.txt' to 'bucket/file.txt'") + +# Download the file back +s3.download_file('bucket', 'file.txt', 'downloaded.txt') +print("Downloaded 'file.txt' as 'downloaded.txt'") +``` +> **Note:** Replace endpoint URL, credentials, and bucket names with your setup. + +Try it yourself! Check out [Boto3 Tutorial](../recipe/Boto3Tutorial.md) for a quick start using Ozone's Docker image. + + +## Method 3: Access Ozone via HttpFS REST API +First, install requests Python module: + +```shell +pip install requests +``` + +### Configuration +- Use Ozone’s HTTPFS endpoint (e.g., `http://httpfs:14000`). + +### Code Example (requests) +```python +#!/usr/bin/python +import requests + +# Ozone HTTPFS endpoint and file path +host = "http://httpfs:14000" +volume = "vol1" +bucket = "bucket1" +filename = "hello.txt" +path = f"/webhdfs/v1/{volume}/{bucket}/{filename}" +user = "ozone" # can be any value in simple auth mode + +# Step 1: Initiate file creation (responds with 307 redirect) +params_create = { + "op": "CREATE", + "overwrite": "true", + "user.name": user +} + +print("Creating file...") +resp_create = requests.put(host + path, params=params_create, allow_redirects=False) + +if resp_create.status_code != 307: + print(f"Unexpected response: {resp_create.status_code}") + print(resp_create.text) + exit(1) + +redirect_url = resp_create.headers['Location'] +print(f"Redirected to: {redirect_url}") + +# Step 2: Write data to the redirected location with correct headers +headers = {"Content-Type": "application/octet-stream"} +content = b"Hello from Ozone HTTPFS!\n" + +resp_upload = requests.put(redirect_url, data=content, headers=headers) +if resp_upload.status_code != 201: + print(f"Upload failed: {resp_upload.status_code}") + print(resp_upload.text) + exit(1) +print("File created successfully.") + +# Step 3: Read the file back +params_open = { + "op": "OPEN", + "user.name": user +} + +print("Reading file...") +resp_read = requests.get(host + path, params=params_open, allow_redirects=True) +if resp_read.ok: + print("File contents:") + print(resp_read.text) +else: + print(f"Read failed: {resp_read.status_code}") + print(resp_read.text) +``` + +Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](../recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image. + + +### Code Example (webhdfs) + +First, install fsspec Python module: + +```shell +pip install fsspec +``` + +```python +from fsspec.implementations.webhdfs import WebHDFS + +fs = WebHDFS(host='httpfs', port=14000, user='ozone') + +# Read a file from /vol1/bucket1/hello.txt +file_path = "/vol1/bucket1/hello.txt" + +with fs.open(file_path, mode='rb') as f: + content = f.read() + print("File contents:") + print(content.decode('utf-8')) +``` +> **Note:** Replace host, port, and path as per your setup. + +## Troubleshooting Tips + +- **Authentication Errors**: Verify credentials and Kerberos tokens (if used). +- **Connection Issues**: Check endpoint URLs, ports, and firewall rules. +- **FileSystem Errors**: Ensure correct Ozone configuration and appropriate permissions. +- **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests fsspec`). + +## References and Further Resources + +- [Apache Ozone Documentation](https://ozone.apache.org/docs/) +- [PyArrow Documentation](https://arrow.apache.org/docs/python/) +- [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) +- [fsspec WebHDFS Python API](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.webhdfs.WebHDFS) diff --git a/hadoop-hdds/docs/content/recipe/Boto3Tutorial.md b/hadoop-hdds/docs/content/recipe/Boto3Tutorial.md new file mode 100644 index 000000000000..ef1079732ddc --- /dev/null +++ b/hadoop-hdds/docs/content/recipe/Boto3Tutorial.md @@ -0,0 +1,123 @@ +--- +title: Access Ozone using Boto3 (Docker Quickstart) +linkTitle: Boto3 Access (Docker) +description: Step-by-step tutorial for accessing Ozone from Python using Boto3 and the S3 Gateway in a Docker environment. +weight: 12 +--- + + + +This tutorial demonstrates how to access Apache Ozone from Python using **Boto3**, via Ozone's S3 Gateway, with Ozone running in Docker. + +## Prerequisites + +- Docker and Docker Compose installed. +- Python 3.x environment. + +## Steps + +### 1️⃣ Start Ozone in Docker + +Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes: + +```bash +curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml +docker compose up -d --scale datanode=3 +``` + +### 2️⃣ Connect to the SCM Container + +```bash +docker exec -it bash +``` +> Change the container id `` to your actual container id. + +The rest of the tutorial will run on this container. + +Create a **bucket** inside the volume **s3v**: + +```bash +ozone sh bucket create s3v/bucket +``` + +### 3️⃣ Install Boto3 in Your Python Environment + +```bash +pip install boto3 +``` + +### 4️⃣ Access Ozone via Boto3 and the S3 Gateway + +Create a Python script (`ozone_boto3_example.py`) with the following content: + +```python +#!/usr/bin/python +import boto3 + +# Create a local file to upload +with open("localfile.txt", "w") as f: + f.write("Hello from Ozone via Boto3!\n") + +# Configure Boto3 client +s3 = boto3.client( + 's3', + endpoint_url='http://s3g:9878', + aws_access_key_id='ozone-access-key', + aws_secret_access_key='ozone-secret-key' +) + +# List buckets +response = s3.list_buckets() +print("Buckets:", response['Buckets']) + +# Upload the file +s3.upload_file('localfile.txt', 'bucket', 'file.txt') +print("Uploaded 'localfile.txt' to 'bucket/file.txt'") + +# Download the file back +s3.download_file('bucket', 'file.txt', 'downloaded.txt') +print("Downloaded 'file.txt' as 'downloaded.txt'") +``` + +Run the script: + +```bash +python ozone_boto3_example.py +``` + +✅ You have now accessed Ozone from Python using Boto3 and verified both upload and download operations. + +## Notes + +- The S3 Gateway listens on port `9878` by default. +- The `Bucket` parameter in Boto3 calls (e.g., `'bucket'` in the script) should be the name of the Ozone bucket created under the `s3v` volume (i.e., the `bucket` part of `s3v/bucket`). The `s3v` volume itself is implicitly handled by the S3 Gateway. +- Make sure the S3 Gateway container (`s3g`) is up and running. You can check using `docker ps`. + +## Troubleshooting Tips + +- **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names). +- **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint. +- **Timeout or DNS Issues**: If you adapt this script to run from your host machine (outside Docker), you might need to replace s3g:9878 with localhost:9878 (assuming default port mapping). + +## References + +- [Apache Ozone Docker](https://github.com/apache/ozone-docker) +- [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) +- [Ozone S3 Docs](https://ozone.apache.org/docs/edge/interface/s3.html) +- [Ozone Securing S3 Docs](https://ozone.apache.org/docs/edge/security/securings3.html) +- [Ozone Client Interfaces](https://ozone.apache.org/docs/edge/interface.html) diff --git a/hadoop-hdds/docs/content/recipe/PyArrowTutorial.md b/hadoop-hdds/docs/content/recipe/PyArrowTutorial.md new file mode 100644 index 000000000000..d1a291376467 --- /dev/null +++ b/hadoop-hdds/docs/content/recipe/PyArrowTutorial.md @@ -0,0 +1,141 @@ +--- +title: Access Ozone using PyArrow (Docker Quickstart) +linkTitle: PyArrow Access (Docker) +summary: Step-by-step tutorial for accessing Ozone from Python using PyArrow in a Docker environment. +weight: 11 +--- + + + +This tutorial demonstrates how to access Apache Ozone from Python using **PyArrow**, with Ozone running in Docker. + +## Prerequisites + +- Docker and Docker Compose installed. +- Python 3.x environment. + +## Steps + +### 1️⃣ Start Ozone in Docker + +Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes: + +```bash +curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml +docker compose up -d --scale datanode=3 +``` + +### 2️⃣ Connect to the SCM Container + +```bash +docker exec -it bash +``` +> Change the container id `` to your actual container id. + +The rest of the tutorial will run on this container. + +Create a volume and a bucket inside Ozone: + +```bash +ozone sh volume create volume +ozone sh bucket create volume/bucket +``` + +### 3️⃣ Install PyArrow in Your Python Environment + +```bash +pip install pyarrow +``` + +### 4️⃣ Download Hadoop Native Libraries for libhdfs Support + +Depending on your system architecture, run one of the following: + +For ARM64 (Apple Silicon, ARM servers): +```bash +curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*' +``` + +For x86_64 (most desktops and servers): +```bash +curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*' +``` + +Set environment variables to point to the native libraries and Ozone classpath: + +```bash +export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/ +export CLASSPATH=$(ozone classpath ozone-tools) +``` + +### 5️⃣ Configure Core-Site.xml + +Add the following to `/etc/hadoop/core-site.xml`: + +```xml + + + fs.defaultFS + ofs://om:9862 + Ozone Manager endpoint + + +``` +> Note: the Docker container has environment variable `OZONE_CONF_DIR=/etc/hadoop/` so it knows where to locate the configuration files. + +### 6️⃣ Access Ozone Using PyArrow + +Create a Python script (`ozone_pyarrow_example.py`) with the following code: + +```python +#!/usr/bin/python +import pyarrow.fs as pafs + +# Connect to Ozone using HadoopFileSystem +# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml +fs = pafs.HadoopFileSystem("default") + +# Create a directory inside the bucket +fs.create_dir("volume/bucket/aaa") + +# Write data to a file +path = "volume/bucket/file1" +with fs.open_output_stream(path) as stream: + stream.write(b'data') +``` + +Run the script: + +```bash +python ozone_pyarrow_example.py +``` + +✅ Congratulations! You’ve successfully accessed Ozone from Python using PyArrow and Docker. + +## Troubleshooting Tips + +- **libhdfs Errors**: Ensure `ARROW_LIBHDFS_DIR` is set and points to the correct native library path. +- **Connection Issues**: Verify the Ozone Manager endpoint (`om:9862`) is correct and reachable. +- **Permissions**: Ensure your Ozone user has the correct permissions for the volume and bucket. + +## References + +- [Apache Ozone Docker](https://github.com/apache/ozone-docker) +- [PyArrow Documentation](https://arrow.apache.org/docs/python/) +- [PyArrow HadoopFileSystem Reference](https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html) +- [Ozone Client Interfaces](https://ozone.apache.org/docs/edge/interface.html) diff --git a/hadoop-hdds/docs/content/recipe/PythonRequestsOzoneHttpFS.md b/hadoop-hdds/docs/content/recipe/PythonRequestsOzoneHttpFS.md new file mode 100644 index 000000000000..b6428574e3e4 --- /dev/null +++ b/hadoop-hdds/docs/content/recipe/PythonRequestsOzoneHttpFS.md @@ -0,0 +1,170 @@ +--- +title: Access Ozone using HTTPFS REST API (Docker + Python Requests) +linkTitle: HTTPFS Access (Docker) +summary: Step-by-step tutorial for accessing Apache Ozone using the HTTPFS REST API via Python's requests library in a Docker-based environment. +weight: 13 +--- + + + +This tutorial demonstrates how to access Apache Ozone using the HTTPFS REST API and Python’s `requests` library. It covers writing and reading a file via simple authentication. + +## Prerequisites + +- Docker and Docker Compose installed +- Python 3.x environment + +## Steps + +### 1️⃣ Start Ozone in Docker + +Download the latest Docker Compose configuration file: + +```bash +curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml +``` + +Add httpfs container configurations and environment variable overrides at the bottom of `docker-compose.yaml`: + +```yaml + httpfs: + <<: *image + ports: + - 14000:14000 + environment: + CORE-SITE.XML_fs.defaultFS: "ofs://om" + CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" + CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" + OZONE-SITE.XML_hdds.scm.safemode.min.datanode: ${OZONE_SAFEMODE_MIN_DATANODES:-1} + <<: *common-config + command: [ "ozone","httpfs" ] +``` + +Start the cluster: + +```bash +docker compose up -d --scale datanode=3 +``` + +### 2️⃣ Create a Volume and Bucket + +Connect to the SCM container: + +```bash +docker exec -it bash +``` +> Change the container id `` to your actual container id. + +The rest of the tutorial will run on this container. + +Create a volume and a bucket: + +```bash +ozone sh volume create vol1 +ozone sh bucket create vol1/bucket1 +``` + +### 3️⃣ Install Required Python Library + +Install the `requests` library: + +```bash +pip install requests +``` + +### 4️⃣ Access Ozone HTTPFS via Python + +Create a script (`ozone_httpfs_example.py`) with the following content: + +```python +#!/usr/bin/python +import requests + +# Ozone HTTPFS endpoint and file path +host = "http://httpfs:14000" +volume = "vol1" +bucket = "bucket1" +filename = "hello.txt" +path = f"/webhdfs/v1/{volume}/{bucket}/{filename}" +user = "ozone" # can be any value in simple auth mode + +# Step 1: Initiate file creation (responds with 307 redirect) +params_create = { + "op": "CREATE", + "overwrite": "true", + "user.name": user +} + +print("Creating file...") +resp_create = requests.put(host + path, params=params_create, allow_redirects=False) + +if resp_create.status_code != 307: + print(f"Unexpected response: {resp_create.status_code}") + print(resp_create.text) + exit(1) + +redirect_url = resp_create.headers['Location'] +print(f"Redirected to: {redirect_url}") + +# Step 2: Write data to the redirected location with correct headers +headers = {"Content-Type": "application/octet-stream"} +content = b"Hello from Ozone HTTPFS!\n" + +resp_upload = requests.put(redirect_url, data=content, headers=headers) +if resp_upload.status_code != 201: + print(f"Upload failed: {resp_upload.status_code}") + print(resp_upload.text) + exit(1) +print("File created successfully.") + +# Step 3: Read the file back +params_open = { + "op": "OPEN", + "user.name": user +} + +print("Reading file...") +resp_read = requests.get(host + path, params=params_open, allow_redirects=True) +if resp_read.ok: + print("File contents:") + print(resp_read.text) +else: + print(f"Read failed: {resp_read.status_code}") + print(resp_read.text) +``` + +Run the script: + +```bash +python ozone_httpfs_example.py +``` + +✅ If everything is configured correctly, this will create a file in Ozone using the REST API and read it back. + +## Troubleshooting Tips + +- **401 Unauthorized**: Make sure `user.name` is passed as a query parameter and that proxy user settings are correct in `core-site.xml`. +- **400 Bad Request**: Add `Content-Type: application/octet-stream` to the request header. +- **Connection Refused**: Ensure `httpfs` container is running and accessible at port 14000. +- **Volume or Bucket Not Found**: Confirm you created `vol1/bucket1` in step 2. + +## References + +- [Apache Ozone HTTPFS Docs](https://ozone.apache.org/docs/edge/interface/httpfs.html) +- [Python requests Documentation](https://requests.readthedocs.io/) +- [Ozone Client Interfaces](https://ozone.apache.org/docs/edge/interface.html)