-
Notifications
You must be signed in to change notification settings - Fork 0
HDDS-13165. [Docs] Python client developer guide. #566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,239 @@ | ||
| --- | ||
| title: "Accessing Apache Ozone from Python" | ||
| date: "2025-06-02" | ||
| weight: 6 | ||
| menu: | ||
| main: | ||
| parent: "Client Interfaces" | ||
| summary: Access Apache Ozone from Python using PyArrow, Boto3, requests and fsspec WebHDFS libraries | ||
| --- | ||
|
|
||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| Apache Ozone project itself does not provide Python client libraries. | ||
| However, several third-party open source libraries can be used to build applications to access an Ozone cluster via | ||
| different interfaces: OFS file system, Ozone HTTPFS REST API and Ozone S3. | ||
|
|
||
| This document outlines these approaches, providing concise setup instructions and validated code examples. | ||
|
|
||
| ## Setup and Prerequisites | ||
|
|
||
| Before starting, ensure the following: | ||
|
|
||
| - Python installed (3.x recommended) | ||
| - Apache Ozone configured and accessible | ||
| - For PyArrow with libhdfs: | ||
| - PyArrow library (`pip install pyarrow`) | ||
| - Hadoop native libraries configured and Ozone classpath specified (see below for details) | ||
| - For S3 access: | ||
| - Boto3 (`pip install boto3`) | ||
| - Ozone S3 Gateway endpoint and bucket names | ||
| - Access credentials (AWS-like key and secret) | ||
| - For HttpFS access: | ||
| - Requests (`pip install requests`) or fsspec (`pip install fsspec`) | ||
|
|
||
| ## Method 1: Access Ozone via PyArrow and libhdfs | ||
|
|
||
| This approach leverages PyArrow's HadoopFileSystem API, which requires libhdfs.so native library. | ||
| The libhdfs.so is not packaged within PyArrow and you must download it separately from Hadoop. | ||
|
|
||
| ### Configuration | ||
| Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and `OZONE_CONF_DIR` is set. | ||
| Also ensure `ARROW_LIBHDFS_DIR` and `CLASSPATH` are set properly. | ||
|
Comment on lines
+56
to
+57
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The document mentions ensuring For users setting up PyArrow access outside of that specific Docker context (e.g., on a system where Hadoop is installed separately), would Could it be beneficial to add a note clarifying this, or mentioning that |
||
|
|
||
| For example, | ||
| ```shell | ||
| export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/ | ||
| export CLASSPATH=$(ozone classpath ozone-tools) | ||
| ``` | ||
|
|
||
| ### Code Example | ||
| ```python | ||
| import pyarrow.fs as pafs | ||
|
|
||
| # Connect to Ozone using HadoopFileSystem | ||
| # "default" tells PyArrow to use the fs.defaultFS property from core-site.xml | ||
| fs = pafs.HadoopFileSystem("default") | ||
|
|
||
| # Create a directory inside the bucket | ||
| fs.create_dir("volume/bucket/aaa") | ||
|
|
||
| # Write data to a file | ||
| path = "volume/bucket/file1" | ||
| with fs.open_output_stream(path) as stream: | ||
| stream.write(b'data') | ||
| ``` | ||
| > **Note:** configure fs.defaultFS in the core-site.xml to point to the Ozone cluster. For example, | ||
| ```xml | ||
| <configuration> | ||
| <property> | ||
| <name>fs.defaultFS</name> | ||
| <value>ofs://om:9862</value> | ||
| <description>Ozone Manager endpoint</description> | ||
| </property> | ||
| </configuration> | ||
| ``` | ||
|
|
||
| Try it yourself! Check out [PyArrow Tutorial](../recipe/PyArrowTutorial.md) for a quick start using Ozone's Docker image. | ||
|
|
||
| ## Method 2: Access Ozone via Boto3 and S3 Gateway | ||
|
|
||
| ### Configuration | ||
| - Identify your Ozone S3 Gateway endpoint (e.g., `http://s3g:9878`). | ||
| - Use AWS-compatible credentials (from Ozone). | ||
|
|
||
| ### Code Example | ||
| ```python | ||
| import boto3 | ||
|
|
||
| # Create a local file to upload | ||
| with open("localfile.txt", "w") as f: | ||
| f.write("Hello from Ozone via Boto3!\n") | ||
|
|
||
| # Configure Boto3 client | ||
| s3 = boto3.client( | ||
| 's3', | ||
| endpoint_url='http://s3g:9878', | ||
| aws_access_key_id='ozone-access-key', | ||
| aws_secret_access_key='ozone-secret-key' | ||
| ) | ||
|
|
||
| # List buckets | ||
| response = s3.list_buckets() | ||
| print("Buckets:", response['Buckets']) | ||
|
|
||
| # Upload the file | ||
| s3.upload_file('localfile.txt', 'bucket', 'file.txt') | ||
| print("Uploaded 'localfile.txt' to 'bucket/file.txt'") | ||
|
|
||
| # Download the file back | ||
| s3.download_file('bucket', 'file.txt', 'downloaded.txt') | ||
| print("Downloaded 'file.txt' as 'downloaded.txt'") | ||
| ``` | ||
| > **Note:** Replace endpoint URL, credentials, and bucket names with your setup. | ||
|
|
||
| Try it yourself! Check out [Boto3 Tutorial](../recipe/Boto3Tutorial.md) for a quick start using Ozone's Docker image. | ||
|
|
||
|
|
||
| ## Method 3: Access Ozone via HttpFS REST API | ||
| First, install requests Python module: | ||
|
|
||
| ```shell | ||
| pip install requests | ||
| ``` | ||
|
|
||
| ### Configuration | ||
| - Use Ozone’s HTTPFS endpoint (e.g., `http://httpfs:14000`). | ||
|
|
||
| ### Code Example (requests) | ||
| ```python | ||
| #!/usr/bin/python | ||
| import requests | ||
|
|
||
| # Ozone HTTPFS endpoint and file path | ||
| host = "http://httpfs:14000" | ||
| volume = "vol1" | ||
| bucket = "bucket1" | ||
| filename = "hello.txt" | ||
| path = f"/webhdfs/v1/{volume}/{bucket}/{filename}" | ||
| user = "ozone" # can be any value in simple auth mode | ||
|
|
||
| # Step 1: Initiate file creation (responds with 307 redirect) | ||
| params_create = { | ||
| "op": "CREATE", | ||
| "overwrite": "true", | ||
| "user.name": user | ||
| } | ||
|
|
||
| print("Creating file...") | ||
| resp_create = requests.put(host + path, params=params_create, allow_redirects=False) | ||
|
|
||
| if resp_create.status_code != 307: | ||
| print(f"Unexpected response: {resp_create.status_code}") | ||
| print(resp_create.text) | ||
| exit(1) | ||
|
|
||
| redirect_url = resp_create.headers['Location'] | ||
| print(f"Redirected to: {redirect_url}") | ||
|
|
||
| # Step 2: Write data to the redirected location with correct headers | ||
| headers = {"Content-Type": "application/octet-stream"} | ||
| content = b"Hello from Ozone HTTPFS!\n" | ||
|
|
||
| resp_upload = requests.put(redirect_url, data=content, headers=headers) | ||
| if resp_upload.status_code != 201: | ||
| print(f"Upload failed: {resp_upload.status_code}") | ||
| print(resp_upload.text) | ||
| exit(1) | ||
| print("File created successfully.") | ||
|
|
||
| # Step 3: Read the file back | ||
| params_open = { | ||
| "op": "OPEN", | ||
| "user.name": user | ||
| } | ||
|
|
||
| print("Reading file...") | ||
| resp_read = requests.get(host + path, params=params_open, allow_redirects=True) | ||
| if resp_read.ok: | ||
| print("File contents:") | ||
| print(resp_read.text) | ||
| else: | ||
| print(f"Read failed: {resp_read.status_code}") | ||
| print(resp_read.text) | ||
| ``` | ||
|
|
||
| Try it yourself! Check out [Access Ozone using HTTPFS REST API Tutorial](../recipe/PythonRequestsOzoneHttpFS.md) for a quick start using Ozone's Docker image. | ||
|
|
||
|
|
||
| ### Code Example (webhdfs) | ||
|
|
||
| First, install fsspec Python module: | ||
|
|
||
| ```shell | ||
| pip install fsspec | ||
| ``` | ||
|
|
||
| ```python | ||
| from fsspec.implementations.webhdfs import WebHDFS | ||
|
|
||
| fs = WebHDFS(host='httpfs', port=14000, user='ozone') | ||
|
|
||
| # Read a file from /vol1/bucket1/hello.txt | ||
| file_path = "/vol1/bucket1/hello.txt" | ||
|
|
||
| with fs.open(file_path, mode='rb') as f: | ||
| content = f.read() | ||
| print("File contents:") | ||
| print(content.decode('utf-8')) | ||
| ``` | ||
| > **Note:** Replace host, port, and path as per your setup. | ||
|
|
||
| ## Troubleshooting Tips | ||
|
|
||
| - **Authentication Errors**: Verify credentials and Kerberos tokens (if used). | ||
| - **Connection Issues**: Check endpoint URLs, ports, and firewall rules. | ||
| - **FileSystem Errors**: Ensure correct Ozone configuration and appropriate permissions. | ||
| - **Missing Dependencies**: Install required Python packages (`pip install pyarrow boto3 requests fsspec`). | ||
|
|
||
| ## References and Further Resources | ||
|
|
||
| - [Apache Ozone Documentation](https://ozone.apache.org/docs/) | ||
| - [PyArrow Documentation](https://arrow.apache.org/docs/python/) | ||
| - [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) | ||
| - [fsspec WebHDFS Python API](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.webhdfs.WebHDFS) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| --- | ||
| title: Access Ozone using Boto3 (Docker Quickstart) | ||
| linkTitle: Boto3 Access (Docker) | ||
| description: Step-by-step tutorial for accessing Ozone from Python using Boto3 and the S3 Gateway in a Docker environment. | ||
| weight: 12 | ||
| --- | ||
|
|
||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| This tutorial demonstrates how to access Apache Ozone from Python using **Boto3**, via Ozone's S3 Gateway, with Ozone running in Docker. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Docker and Docker Compose installed. | ||
| - Python 3.x environment. | ||
|
|
||
| ## Steps | ||
|
|
||
| ### 1️⃣ Start Ozone in Docker | ||
|
|
||
| Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes: | ||
|
|
||
| ```bash | ||
| curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml | ||
| docker compose up -d --scale datanode=3 | ||
| ``` | ||
|
|
||
| ### 2️⃣ Connect to the SCM Container | ||
|
|
||
| ```bash | ||
| docker exec -it <your-scm-container-name-or-id> bash | ||
| ``` | ||
| > Change the container id `<your-scm-container-name-or-id>` to your actual container id. | ||
|
|
||
| The rest of the tutorial will run on this container. | ||
|
|
||
| Create a **bucket** inside the volume **s3v**: | ||
|
|
||
| ```bash | ||
| ozone sh bucket create s3v/bucket | ||
| ``` | ||
|
|
||
| ### 3️⃣ Install Boto3 in Your Python Environment | ||
|
|
||
| ```bash | ||
| pip install boto3 | ||
| ``` | ||
|
|
||
| ### 4️⃣ Access Ozone via Boto3 and the S3 Gateway | ||
|
|
||
| Create a Python script (`ozone_boto3_example.py`) with the following content: | ||
|
|
||
| ```python | ||
| #!/usr/bin/python | ||
| import boto3 | ||
|
|
||
| # Create a local file to upload | ||
| with open("localfile.txt", "w") as f: | ||
| f.write("Hello from Ozone via Boto3!\n") | ||
|
|
||
| # Configure Boto3 client | ||
| s3 = boto3.client( | ||
| 's3', | ||
| endpoint_url='http://s3g:9878', | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Python script uses However, the tutorial implies installing Could you clarify where the user is expected to run this Python script? If on the host, the endpoint URL and the troubleshooting tip on line 113 (which mentions |
||
| aws_access_key_id='ozone-access-key', | ||
| aws_secret_access_key='ozone-secret-key' | ||
| ) | ||
|
|
||
| # List buckets | ||
| response = s3.list_buckets() | ||
| print("Buckets:", response['Buckets']) | ||
|
|
||
| # Upload the file | ||
| s3.upload_file('localfile.txt', 'bucket', 'file.txt') | ||
| print("Uploaded 'localfile.txt' to 'bucket/file.txt'") | ||
|
|
||
| # Download the file back | ||
| s3.download_file('bucket', 'file.txt', 'downloaded.txt') | ||
| print("Downloaded 'file.txt' as 'downloaded.txt'") | ||
| ``` | ||
|
|
||
| Run the script: | ||
|
|
||
| ```bash | ||
| python ozone_boto3_example.py | ||
| ``` | ||
|
|
||
| ✅ You have now accessed Ozone from Python using Boto3 and verified both upload and download operations. | ||
|
|
||
| ## Notes | ||
|
|
||
| - The S3 Gateway listens on port `9878` by default. | ||
| - The `Bucket` parameter in Boto3 calls (e.g., `'bucket'` in the script) should be the name of the Ozone bucket created under the `s3v` volume (i.e., the `bucket` part of `s3v/bucket`). The `s3v` volume itself is implicitly handled by the S3 Gateway. | ||
| - Make sure the S3 Gateway container (`s3g`) is up and running. You can check using `docker ps`. | ||
|
|
||
| ## Troubleshooting Tips | ||
|
|
||
| - **Access Denied or Bucket Not Found**: Ensure that the bucket name exists and matches exactly (Ozone S3 Gateway uses flat bucket names). | ||
| - **Connection Refused**: Check that the S3 Gateway container is running and accessible at the specified endpoint. | ||
| - **Timeout or DNS Issues**: If you adapt this script to run from your host machine (outside Docker), you might need to replace s3g:9878 with localhost:9878 (assuming default port mapping). | ||
|
|
||
| ## References | ||
|
|
||
| - [Apache Ozone Docker](https://github.com/apache/ozone-docker) | ||
| - [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) | ||
| - [Ozone S3 Docs](https://ozone.apache.org/docs/edge/interface/s3.html) | ||
| - [Ozone Securing S3 Docs](https://ozone.apache.org/docs/edge/security/securings3.html) | ||
| - [Ozone Client Interfaces](https://ozone.apache.org/docs/edge/interface.html) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For accessing WebHDFS via
fsspec, simplypip install fsspecmight not be sufficient aswebhdfsis often an optional dependency. Could you clarify if a more specific installation likepip install fsspec[webhdfs]or installing a separatewebhdfs3(whichfsspecoften uses) package is required for thefsspec.implementations.webhdfsexample to work out of the box? This would help users avoid potentialModuleNotFoundErroror similar issues.