Swift Object Storage#
Cloud Storage Service
Object Storage from the UZH Science Cloud.
Accessing Object Storage#
Interaction with OpenStack Object Storage (Swift or Ceph RGW) is fundamentally executed via RESTful API calls. To construct these requests, two critical pieces of information must be dynamically retrieved from the OpenStack environment: the public endpoint URL of the storage gateway and a valid authentication token.
Endpoint Retrieval & Authentication#
Assuming the terminal session is authenticated via source project-openrc.sh, the required parameters can be extracted utilizing the OpenStack CLI and JSON parsing utilities like jq.
Endpoint URL (
RGW_URL): The storage service exposes various endpoints (public, internal, admin). The public URL is extracted from the service catalog.Authentication Token (
RGW_TOKEN): A temporary Keystone token must be generated to authorize the HTTP requests.
# Extract the public endpoint URL specifically
export RGW_URL=$(openstack catalog show swift -f json -c endpoints \
| jq --raw-output '.endpoints[] | select(.interface=="public") | .url' \
| head -n 1)
# Extract the temporary token string
export RGW_TOKEN=$(openstack token issue -f value -c id)
Token Expiration
Authentication tokens issued by Keystone are temporary and typically expire after one hour. Once expired, API requests will return 401 Unauthorized errors, necessitating the generation of a new token.
Interface Methodologies#
Once the endpoint architecture is understood, storage interaction can be facilitated via two primary methodologies: utilizing the native Swift API directly, or leveraging OpenStack’s S3 compatibility layer with standard S3 tooling.
Native Swift API (curl)#
The most fundamental approach involves issuing raw HTTP requests directly to the Swift API utilizing curl. This method utilizes the temporary X-Auth-Token retrieved previously. While highly transparent and requiring no additional software dependencies, it becomes syntactically cumbersome for complex recursive operations or large file multipart uploads.
# List all containers (buckets) within the project
curl -i -X GET "${RGW_URL}" \
-H "X-Auth-Token: ${RGW_TOKEN}"
# Upload a local file to a specific container
curl -i -X PUT "${RGW_URL}/mybucket/myfile.txt" \
-H "X-Auth-Token: ${RGW_TOKEN}" \
--data-binary "@./myfile.txt"
# Remove an object from the storage
curl -i -X DELETE "${RGW_URL}/mybucket/myfile.txt" \
-H "X-Auth-Token: ${RGW_TOKEN}"
S3 Compatibility API (s3cmd)#
Modern OpenStack deployments provide an S3-compatible API layer. This allows the utilization of robust, established S3 clients such as s3cmd.
However, the S3 protocol does not utilize temporary Keystone tokens. Instead, it relies on static Access and Secret keys. To interface with the S3 compatibility layer, AWS EC2-style credentials must first be generated within the OpenStack project.
# Generate standard S3-compatible access credentials
ACCESS_KEY=$(openstack ec2 credentials create -f value -c access)
SECRET_KEY=$(openstack ec2 credentials show "${ACCESS_KEY}" -f value -c secret)
These credentials, along with the parsed host domain (the RGW_URL stripped of its https:// prefix, denoted here as RGW_HOST), are then injected into the standard s3cmd configuration file (~/.s3cfg).
# ~/.s3cfg
[default]
host_base = ${RGW_HOST}
host_bucket = ${RGW_HOST}
access_key = ${ACCESS_KEY}
secret_key = ${SECRET_KEY}
use_https = True
Once configured, the storage can be interacted with using standard S3 commands (e.g., s3cmd ls s3://mybucket).
Configuration Overhead Mitigation
The prerequisite steps of generating EC2 credentials, parsing endpoints, and configuring .s3cfg files introduce significant setup friction, particularly in automated CI/CD pipelines or reproducible compute jobs. To mitigate this configuration overhead, the entire authentication and s3cmd execution sequence can be encapsulated within an orchestration container.
A reference implementation of this containerized abstraction is available here: GitHub: pSciComp/s3cmdContainer
Interfacing with OpenStack object storage infrastructure is accomplished via two primary protocols: the native OpenStack Swift REST API or the Amazon S3-compatible API (typically provided via Ceph RADOS Gateway or Swift3 middleware). The selection of the interface dictates the required tooling and authentication mechanisms.
Available Interfacing Tools#
A variety of open-source utilities and libraries are available, ranging from low-level HTTP clients to high-level data synchronization frameworks:
Command-Line Interfaces (CLIs):
curl: The universal standard for raw HTTP communication. It interacts directly with the native Swift REST API utilizing temporary Keystone tokens. It requires no additional dependencies, making it ideal for minimal, reproducible shell scripts.s3cmd&mc(MinIO Client): Dedicated CLI utilities designed for S3-compatible endpoints. They require EC2-style static credentials (Access/Secret keys) rather than temporary tokens and utilize the standards3://URI scheme.rclone: A robust synchronization utility that supports multiple cloud storage backends. It is optimized for large-scale data transfers, directory mirroring, and validating data integrity via checksums.
Programmatic Integration:
boto3: The standard Python SDK for S3. It allows storage operations to be embedded directly within analytical pipelines or application code.Data Science Libraries (e.g.,
pandas): High-level libraries often support direct ingestion of remote data. For instance, a CSV or Parquet file can be read directly into a DataFrame from an S3 endpoint by passing the appropriate storage options and URI, bypassing local intermediate storage.
Exercise Execution Paths#
For execution and automation within reproducible environments, two distinct methodologies are highlighted:
1. Direct API Execution (curl)#
This approach leverages the native Swift API. Operations are executed as standard HTTP requests (GET, PUT, DELETE). Authentication is handled by passing the temporary OpenStack token ($OS_AUTH_TOKEN) via the X-Auth-Token HTTP header.
The target resource is identified by appending the container name and object key directly to the storage endpoint URL (e.g., "$OS_STORAGE_URL/mycontainer/myfile.txt"). This method guarantees execution capability on any standard POSIX system without installing specialized client software.
2. Containerized S3 Compatibility (s3cmdc)#
When the syntactic simplicity of the S3 protocol is preferred, the s3cmd utility is utilized. To maintain environmental purity and avoid global package installations on the host system, a containerized abstraction of s3cmd (referred to here as s3cmdc) is recommended.
This container encapsulates the software dependencies and the .s3cfg configuration file, allowing standard S3 commands (e.g., s3cmdc put local_file s3://bucket/remote_file) to be executed dynamically while strictly adhering to the principles of computational reproducibility.