Installing and Running TensorFlow on the HPC Clusters

TensorFlow is the most popular software package for training deep learning models. This tutorial explains how to install TensorFlow on the HPC clusters (TigerGpu and Adroit) and run TensorFlow jobs using the Slurm scheduler.


Installing and Running ‘mpi4py’ on the Cluster

Installing mpi4py

First, don’t use conda install mpi4py. This will install its own version of MPI instead of using one of the versions that already exist on the cluster. It will work, but it will be cripplingly slow.

The proper way to install mpi4py is to use pip together with one of the MPI versions that already exists on the cluster.

What follows are step-by-step instructions on how to set up mpi4py on the Adroit cluster. The steps are similar for any other cluster that uses modules (e.g. Della or Tigercpu).

Continue reading Installing and Running ‘mpi4py’ on the Cluster

How to Run Multiple Serial Programs as a Single SLURM Job

Job Arrays is a features that automates the process of submitting and managing a collection of similar jobs in SLURM. Using Job Arrays, we can put a bunch of jobs in the queue with a single SLURM sbatch command. For example, if we have two similar jobs that we want to run, we can submit them to the queue using a Job Array, and the SLURM queue will look as follows:
Continue reading How to Run Multiple Serial Programs as a Single SLURM Job

Jupyter on the Cluster

Running Jupyter on your local machine is straight forward, but sometimes you need more computational resources which might mean hosting your work on a remote computer. This article will show you how to run Jupyter on a remote machine through an ssh tunnel such that you can interact with it in your local web browser.

Continue reading Jupyter on the Cluster