Using Python

This page explains how to transfer data using Python scripting.

Data in Yoda is not directly accessible, you have to download data to the machine that contains your analysis software first. If you do your analysis with Python scripts anyway, for example on Snellius or Ada, it can be useful to script the data access and transfer as well.

Python iRODS Client

The Python iRODS Client (PRC) is the default way to access data in iRODS programatically.

Install

pip install python-irodsclient

Setting up a session to access Yoda

The easiest way to setup a session to Yoda is by using the information in the irods environment file.

The code below sets up a session using all the correct settings for Yoda:

import json
from irods.session import iRODSSession
from pathlib import Path
from getpass import getpass
import ssl

def get_irods_environment(irods_environment_file):
    """Reads the irods_environment.json file, which contains the environment configuration."""

    print(
        f"Trying to retrieve connection settings from: {irods_environment_file}"
    )

    try:
        with open(irods_environment_file, "r") as f:
            return json.load(f)
    except:
        print(f'Could not open {irods_environment_file}')
        exit()

def setup_session(ca_file='/etc/ssl/certs/ca-certificates.crt'):
    """Use irods environment files to configure a iRODSSession. User is prompted for the password"""

    irods_env = get_irods_environment(f"{Path.home()}/.irods/irods_environment.json")

    password = getpass(f"Enter valid DAP for user {irods_env['irods_user_name']}: ")

    ssl_context = ssl.create_default_context(
        purpose=ssl.Purpose.SERVER_AUTH, cafile=ca_file, capath=None, cadata=None
    )

    ssl_settings = {
        "client_server_negotiation": "request_server_negotiation",
        "client_server_policy": "CS_NEG_REQUIRE",
        "encryption_algorithm": "AES-256-CBC",
        "encryption_key_size": 32,
        "encryption_num_hash_rounds": 16,
        "encryption_salt_size": 8,
        "ssl_context": ssl_context,
    }

    session = iRODSSession(
        host=irods_env["irods_host"],
        port=irods_env["irods_port"],
        user=irods_env["irods_user_name"],
        password=password,
        zone=irods_env["irods_zone_name"],
        authentication_scheme="pam_password",
        **ssl_settings,
    )

    return session

session=setup_session()

# workload
coll=session.collections.get(f"/{session.zone}/home")
for col in coll.subcollections:
    print(col.name)

More information

You can find more information on using the iRODS client in the README on github.

iBridges

The PRC can be hard to use, because it requires some prior knowledge on the structure and terminology used in iRODS. For this reason, developers at Utrecht University created iBridges, which makes it easier to do basic file and metadata manipulation in iRODS.

Installation

Installation is again as simple as:

pip install ibridges

Connecting

To connect you will need the irods environment file. iBridges expects the file to be in ~/.irods/irods_environment.json but you can point it to a different location.

from ibridges import Session
from pathlib import Path
from getpass import getpass

password = getpass(f"Enter valid DAP: ")
session = Session(irods_env_path=Path.home() / ".irods" / "irods_environment.json", password=password)

Upload data

You can easily upload your data with the previously created session:

from ibridges import upload

upload(session, "/your/local/path", "/irods/path")

This upload function can upload both directories (collections in iRODS) and files (data objects in iRODS).

Add iRODS metadata

One of the powerful features of iRODS is its ability to store metadata with your data in a consistent manner. Let’s add some metadata to a collection or data object:

from ibridges import IrodsPath

ipath = IrodsPath(session, "/irods/path")
ipath.meta.add("some_key", "some_value", "some_units")

We have used the IrodsPath class here, which is another central class to the iBridges API. From here we have access to the metadata as shown above, but additionally there are many more convenient features directly accessible such as getting the size of a collection or data object. A detailed description of the features is present in another part of the documentation.

Download data

Naturally, we also want to download the data back to our local machine. This is done with the download function:

from ibridges import download

download(session, "/irods/path", "/other/local/path")

Closing the session

When you are done with your session, you should generally close it:

session.close()

More information

More information on using iBridges can be found in the online documentation.

Streaming

With the python-irodsclient which iBridges is built on, we can open the file inside of a data object as a stream and process the content without downloading the data. This is especially useful if you need to access data stored in large files. That works without any problems for textual data.

from ibridges import IrodsPath

obj_path = IrodsPath(session, "path", "to", "object")
with obj_path.open('r') as stream:
    content = stream.read().decode()

Some python libraries allow to be instantiated directly from such a stream. This is supported by e.g. pandas and polars for datafiles or whisper for transcription and translation of audio files.

import pandas as pd

with obj_path.open('r') as stream:
    df = pd.read_csv(stream)

print(df)