糖心破解版

Skip to main content

HPC Machine Learning Tutorial

Objective

The objective of this tutorial is to familiarize the user with some of the nuances of the WAVE HPC using a basic machine learning example, including:

  • Working with external datasets
  • Working with pre-installed modules
  • Using Slurm to access compute nodes

This tutorial is an adaptation of the from .

To run this tutorial, it is assumed that you already have access to the WAVE HPC with a user account and the ability to open a terminal session on one of the login nodes in the WAVE cluster. See WAVE HPC User Guide - Accessing the HPC if you require help on accessing the HPC.

Download the dataset into a local filesystem

Let's start by downloading the MNIST dataset to a local directory where you have access.

mkdir /WAVE/<path to your own dataset directory>/mnist
wget -O /WAVE/<path to your own dataset directory>/mnist/mnist.npz https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
ls -l /WAVE/<path to your own dataset directory>/mnist/

Note: In this tutorial, you will need to substitute the appropriate project and dataset subdirectories. A possible workaround, if you are just getting started, is to replace the dataset path directory with a directory where you have access to store both your dataset and project. More information can be found in the section Managing Files.

At this point, you should have a local copy of the external data.

[<username>@login2 ~]$ ls -l /WAVE/datasets/<your dataset directory>/mnist/
total 11224
-rw-rw-r--. 1 <username> <group> 11490434 May 30  2018 mnist.npz
[<username>@login2 ~]$

Establish a projects folder

While we are at it, let's establish a subdirectory within our projects folder to hold our working files and switch to that folder.

mkdir -p /WAVE/<path to your own dataset directory>/mnist-tutorial
cd /WAVE/<path to your own dataset directory>/mnist-tutorial
# At this point we should be working out of the 鈥渕nist-tutorial鈥 projects folder
[<username>@login2 mnist-tutorial]$ pwd
/WAVE/<path to your own dataset directory>/mnist-tutorial
[<username>@login2 mnist-tutorial]$

Working with pre-installed modules

The WAVE HPC has pre-installed software covering many parallel and scientific computing needs, which are available via modules. Use the following command to see which modules are available:

module available

or

module avail

In this tutorial, we will be using , which is an open-source platform for machine learning.

Load TensorFlow module

There are several versions of TensorFlow available on the WAVE HPC. We will use the latest, default version. Use the following command to load the TensorFlow software:

module load TensorFlow

At this point, we should have TensorFlow, plus some other dependent modules, installed and ready for our use. Let's check that, first by listing what modules are installed. We could do that with the module list command, which will list all the software packages that came with the TensorFlow module. That is interesting, but what is probably more important is specific packages required by our program. Are they there, and are they compatible? Let's write a quick Python script that will check the installation for those specific modules.

In the Python code below, we are interested in importing TensorFlow and NumPy. Let's use the following code to check the software installation.

# check versions
import tensorflow
print('tensorflow: %s' % tensorflow.__version__)
import numpy
print('numpy: %s' % numpy.__version__)

Using an editor, we will add the code to a file called versions.py. Note that we are executing from the /WAVE/<path to your own dataset directory>/mnist-tutorial directory.

The following command will execute our file:

python versions.py

Below is the result of that command:

[<username>@login2 mnist-tutorial]$ python versions.py
2021-10-13 13:08:24.307690: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
tensorflow: 2.3.2
numpy: 1.17.3
[<username>@login2 mnist-tutorial]$

At this point, we have a local copy of the data and we are able to load the required modules. It should be noted that apart from this approach the user could also use a terminal (e.g. WAVE Shell Access), type "python" and run the previous commands right away.

Now we turn our attention to our sample Python model.

Sample Python program

Using an editor, we'll add the following code to a file called mnist-tutorial.py. This is our sample Python program. This program is slightly different from the tutorial presented in  ; the main difference lies in how the data is loaded. The TensorFlow.org tutorial relies on a Keras utility that will load data from an external URL. In the HPC, however, we will be running this program from a compute node, which does not have external internet access. As such, we imported the data from a local copy stored in the datasets filesystem, and we will access it from there.

Note: It is not the intention of this tutorial to teach the user how to use TensorFlow for machine learning. Instead, we are focused specifically on the nuances of running something like TensorFlow within the HPC. If you are interested in understanding more about how this code does machine learning, I would refer you back to .

# Set up
import numpy as np
import tensorflow as tf
import os
 
# Load Data from .npz file
data_dir = '/WAVE/datasets/<your dataset directory>/mnist/'
from tensorflow.keras.datasets import mnist
(train_examples, train_labels), (test_examples, test_labels) = mnist.load_data(path=data_dir+'mnist.npz')

# Load NumPy arrays with tf.data.Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

# Use the datasets

# Shuffle and batch the datasets
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

# Build and train a model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['sparse_categorical_accuracy'])
model.fit(train_dataset, epochs=10)
model.evaluate(test_dataset)

We are now ready to run our tutorial program. Because this is such a simple program, we could run it from a login node, but we do not want to do that. The login nodes in the WAVE HPC are for setting up and configuring our environment. They are not appropriate for compute-intensive tasks. Instead, the login nodes are our gateway to the compute resources in the WAVE cluster. We will use a resource scheduling program called Slurm to gain access to those compute resources.

Using Slurm to access compute nodes

Slurm provides us with the ability to execute a job on the backend compute nodes from either an interactive or batch perspective. We will look at both approaches here, but in general a batch approach is more appropriate for longer, compute-intensive tasks. Here you can find links to both approaches: