Hydra Cluster: Containers¶
Introduction¶
Containers offer an opportunity to run a pre-built environment within the cluster. Docker is one of the most popular, but for security reasons isn't suitable for running within a multiuser HPC. Instead, we've installed Apptainer (formerly Singularity) which is designed for this use case, and it can also import Docker containers.
You can run Apptainer directly on the hydra login node for setup and testing, or through srun/sbatch like any other job.
For further information on getting started please consult the Apptainer Quick Start Guide.
Using Docker containers¶
Apptainer can download and convert Docker containers for local use. Here's a simple example:
tdb@hydra:~$ apptainer run docker://hello-world
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob 2db29710123e done
Copying config 811f3caa88 done
Writing manifest to image destination
Storing signatures
2022/12/08 14:58:37 info unpack layer: sha256:2db29710123e3e53a794f2694094b9b4338aa9ee5c40b930cb8063a1be392c54
INFO: Creating SIF file...
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
tdb@hydra:~$
Be aware that these images are cached in your home directory under
.apptainer/cache
, so they will take up space if you don't clear them
out when
no longer needed.
NVIDIA Containers¶
NVIDIA provides containers for TensorFlow and PyTorch that may be useful. These are carefully built and tuned to have the best set of libraries and tools that you'll need for either of those applications. They can be found in the NGC Catalog.
If you've followed the example on the TensorFlow page, then you can run the following to test this out using the NVIDIA TensorFlow container. Note that this will take a while to download and use up a bit of disk space. Adjust the version numbers as needed. Output trimmed here for brevity.
tdb@hydra:~$ srun -p gpu --mem 10g --gres gpu:1 apptainer run docker://nvcr.io/nvidia/tensorflow:22.11-tf2-py3 python3 tftest.py
srun: job 254067 queued and waiting for resources
srun: job 254067 has been allocated resources
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:6b08e0981273a337a2162ece566ce3f03b653a83d2b4478c4a493e7b731d3e11
...
Copying blob sha256:e953878b44f04be77d59a6b40ecc020caffe9e18fe5212c786623bf8c4c75cad
Copying config sha256:1fc9923bcb83fa7a115228046cde58997338419927f3da79822a0a98cc5a5435
Writing manifest to image destination
Storing signatures
2022/12/08 15:32:30 info unpack layer: sha256:eaead16dc43bb8811d4ff450935d607f9ba4baffda4fc110cc402fa43f601d83
...
2022/12/08 15:35:36 info unpack layer: sha256:e953878b44f04be77d59a6b40ecc020caffe9e18fe5212c786623bf8c4c75cad
INFO: Creating SIF file...
13:4: not a valid test operator: (
13:4: not a valid test operator: 515.48.07
================
== TensorFlow ==
================
NVIDIA Release 22.11-tf2 (build 48527487)
TensorFlow Version 2.10.0
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
Failed to detect NVIDIA driver version.
2022-12-08 15:48:48.767079: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-08 15:48:50.650381: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-12-08 15:49:01.579561: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-08 15:49:01.784920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78889 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
tf.Tensor(-48.660248, shape=(), dtype=float32)
tdb@hydra:~$
Subsequent runs will be much quicker because the cached container will be used.
This only works on the Enterprise grade cards, so you should use either the
pascal
or ampere
GPU types.
Accessing your files within a container¶
Your files are automatically mounted within the container, and you have the same user ID as you normally do on Hydra.
tdb@hydra:~$ apptainer run docker://ubuntu
INFO: Using cached SIF image
INFO: underlay of /etc/localtime required more than 50 (68) bind mounts
Apptainer> whoami
tdb
Apptainer> ls -l tftest.py
-rwxr-xr-x 1 tdb cur 356 Dec 8 15:10 tftest.py
Apptainer>
exit
tdb@hydra:~$
If you have any questions, or useful examples that we can add, please feel free to contact us.