Hydra Cluster: Conda¶

This document was contributed by Lisa Bonheme.

Using specific versions of Python, TensorFlow and PyTorch¶

Here are the steps to set up a Conda configuration for specific versions of Python, CUDA and cuDNN. This is quite useful when using a specific version of TensorFlow or PyTorch that requires CUDA and cuDNN versions that are not installed on the cluster, or for running scripts that are not compatible with the version of Python installed on the cluster.

Any scripts created following these steps should work across different nodes and Python versions as it will rely on the versions specified inside the Conda environment.

Step 1: Prepare the miniconda venv¶

Install miniconda3¶

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

This will start the conda installer. You'll need to accept the license agreement and choose a location to install in. At the end, when prompted, say yes to running conda init.

Log out and back in for conda to be activated.

Update the Conda installation¶

If you have an existing installation, or even for a fresh installation it's worth updating conda before proceeding. This will download any newer packages that have been made available.

conda update conda

Create a Conda environment¶

Create a conda environment with Python 3.9. A different Python version can be used instead.

conda create --name myenv python=3.9

Activate the Conda environment¶

This makes the environment active for the current session.

conda activate myenv

Step 2: Install packages¶

TensorFlow 2.12¶

Install CUDA and cuDNN for TensorFlow 2.12. The packages may be different with other TensorFlow versions, check https://www.tensorflow.org/install/pip for more details.

conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*

PyTorch 2¶

Install CUDA for PyTorch 2 instead. The packages may be different with other PyTorch versions, check https://pytorch.org/get-started/locally/ for more details.

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Install ncurses¶

This needs to be installed to prevent the error: libtinfo.so.6: no version information available (required by /bin/bash) when launching scripts on the cluster.

conda install -c conda-forge ncurses

Other packages¶

If you want to have access to additional non-Python packages they can be installed from the Conda package manager.

For example, to install Octave:

conda install -c conda-forge octave

Using requirements.txt¶

Install packages, e.g. from requirements.txt

pip install -r requirements.txt

Check packages¶

Check that everything is installed properly. This should list all the install packages.

pip freeze

Set up custom paths¶

Create a directory where any custom config will be stored:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d

Export the useful environment variables for conda:

echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib" >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

You can add any other environment variables that will always be used in this conda environment in this env_var.sh file.

Step 3: Use conda with scripts run in the cluster¶

Create a script¶

Create a bash script, for example test.sh, containing the following. Here the SBATCH flags will export any environment variables that you have and send you an email if your script fails.

#!/bin/bash
#SBATCH --mail-type=FAIL
#SBATCH --export=ALL
source ~/miniconda3/etc/profile.d/conda.sh
conda activate myenv
# Launch your python script below, for example
pip $*

Make sure the script is executable:

chmod +x test.sh

Launch the script with sbatch¶

This will execute pip freese on the gpu parititon with 1 GPU and 15G of RAM allocated.

sbatch -p gpu --gres=gpu:1 --mem=15G test.sh freeze

Development¶

You can down work on your Python code, using a variation on test.sh to run it on the cluster, whilst making use of the Python version and tools installed within the enviroment that you've created.

As usual, please contact us with any queries.