Practical session 4


The way the processor industry is going, is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it. - Steve Jobs


Development Tools for Scientific Computing 2024/2025

Pasquale Claudio Africa, Dario Coscia
21 Feb 2025

Part 1: From Binary to Multi-Class Classification

In the previous two practical sessions, we explored binary classification, which involves categorizing data into two distinct classes. However, many real-world problems require us to classify data into more than two categories, a task known as multi-class classification. In today's session, we will extend our understanding from binary to multi-class classification.

For this practical, we will work with AlexNet, a convolutional neural network architecture that made significant contributions to deep learning, especially in image classification tasks. The model contains M of parameters, just as reference modern Large Language Models (LLMs) have ~ B of parameters! We will also use the CIFAR-10 dataset, which contains 60,000 32x32 color images in 10 different classes, including airplanes, cars, birds, cats, and more. CIFAR-10 is a widely used benchmark dataset for training image classification models. The goal here is to classify images into one of these ten classes.

To efficiently train our model on larger datasets and scale it across multiple GPUs, we will utilize PyTorch Lightning, a high-level framework built on top of PyTorch. PyTorch Lightning simplifies the process of distributed and multi-GPU training, making it easier to implement, scale, and manage training workloads without having to manage the underlying complexities.

Install Required Tools and structure the package:

  • Activate the devtools_scicomp, install PyTorch Lightning and add it to the requirements.txt file. Do the same for tensorboard and torchvision.
  • Inside the devtools_scicomp_project_2025 repository create a new branch starting from the main one called deep_classifier.
  • Inside the src/pyclassify/ create model.py, module.py, datamodule.py.

Part 2: Build a Classifier with Lightining

  1. Create models and modules:
  • Inside the src/pyclassify/model.py create the AlexNet. You might want to use nn.Conv2d, nn.ReLU, nn.MaxPool2d, nn.Dropout in your implementation:

    class AlexNet(nn.Module):
        def __init__(self, num_classes):
            super().__init__()
            self.num_classes = num_classes
            self.features = nn.Sequential(
                # here insert convolutional blocks
            )
            self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
            self.classifier = nn.Sequential(
                # here insert the linear + dropout blocks
            )
        def forward(self, x):
            x = self.avgpool(self.features(x)).flatten(start_dim=1)
            logits = self.classifier(x)
            return logits
    
  • Inside the src/pyclassify/module.py create the Classifier module using PyTorch Lightning:
    1. In the __init__ method instantiate the AlexNet model, create the class attributes train_accuracy by:

      self.train_accuracy = torchmetrics.classification.Accuracy(task="multiclass", num_classes=???)
      

      choose the right number of classes. Do it also for val_accuracy and test_accuracy.

    2. Write a private method called _classifier_step which takes as input the batch and: (1) extract features, true_labels from it, (2) computes the logits by performing a forward pass (just call self(features)), (3) compute the CrossEntropyLoss, (4) returns the predicted label (the most probable), the true label and the loss.

  • Set up the training_step, validation_step, test_step. You should call the _classifier_step given a batch, log the accuracy, and for the training_step only return the loss as well (otherwise the module does not know what to optimize). You can log by

    self.log('train_accuracy', self.train_accuracy, on_step=True, on_epoch=False)
    

    For more on logging read here.

  • Finally, add the following class method to configure the optimizer

    def forward(self, x):
        return self.model(x)
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        return optimizer
    
  1. Create the CIFAR10DataModule:
  • Inside src/pyclassify/datamodule.py create the Lightning Datamodule:

    class CIFAR10DataModule(pl.LightningDataModule):
        def __init__(self, data_path=WHERE_TO_SAVE, batch_size=64):
            super().__init__()
            self.data_path = data_path
            self.batch_size = batch_size
        def prepare_data(self):
            datasets.CIFAR10(root=self.data_path, download=True)
            self.transform = transforms.Compose(
                [transforms.Resize((70, 70)), transforms.RandomCrop((64, 64)),
                 transforms.ToTensor()])
        def setup(self, stage=None):
            train = datasets.CIFAR10(
                root=self.data_path,
                train=True,
                transform=self.transform,
                download=False,
            )
            self.train, self.valid = random_split(train, lengths=[45000, 5000])
            self.test = datasets.CIFAR10(
                root=self.data_path,
                train=False,
                transform=self.transform,
                download=False,
            )
        def train_dataloader(self):
            # build train dataloader
        def val_dataloader(self):
            # build val dataloader
        def test_dataloader(self):
            # build test dataloader
    
  1. Setup the LightningCLI

Implementing a command line interface (CLI) makes it possible to execute an experiment from a shell terminal. By having a CLI, there is a clear separation between the Python source code and what hyperparameters are used for a particular experiment. If the CLI corresponds to a stable version of the code, reproducing an experiment can be achieved by installing the same version of the code plus dependencies and running with the same configuration.

Lightning projects usually begin with one model and one dataset. As the project grows in complexity and you introduce more models and more datasets, it becomes desirable to mix any model with any dataset directly from the command line without changing your code.

# Mix and match anything
$ python main.py fit --model=GAN --data=MNIST
$ python main.py fit --model=Transformer --data=MNIST

LightningCLI makes this very simple. Otherwise, this kind of configuration requires a significant amount of boilerplate that often looks like this:


    # choose model
    if args.model == "gan":
        model = GAN(args.feat_dim)
    elif args.model == "transformer":
        model = Transformer(args.feat_dim)
    ...
    # choose datamodule
    if args.data == "MNIST":
        datamodule = MNIST()
    elif args.data == "imagenet":
        datamodule = Imagenet()
    ...
    # mix them!
    trainer.fit(model, datamodule)

It is highly recommended that you avoid writing this kind of boilerplate and use LightningCLI instead. Let's build one! Inside scripts/run.py setup the CLI. This is very easy, just write the following:

from lightning.pytorch.cli import LightningCLI
import pyclassify.model
import pyclassify.module
import pyclassify.datamodule

cli = LightningCLI(subclass_mode_data=True, subclass_mode_model=True)

Part 3: Train, Test, Train, Test, Train, ... SCALE!

  1. Train the model on 1 GPU
    Using the Lightning CLI, set up the config file by:
python scripts/run.py fit --print_config > experiments/config.yaml

Update the config file to handle module kwargs, and run the CLI on CPU for 2 epoch
using:

python scripts/run.py fit --config=experiments/config.yaml

If everything works fine, push your changes to origin/deep_classifier.

  1. Train the model on 2 GPUs
    Can we scale and train in a multigpu fashion? Training in multi-gpus stricly depends on your application. For example, if you have a relatively small model (as AlexNet) but a lot of data you might want to use Distributed Data Parallel training; while if your model does not fit in a single GPU (like moder LLM) you might want to use Distributed Model Parallel training. Today we will see Distributed Data Parallel training.

Distributed Data Parallel training (DDP) works as follows:

  1. Each GPU across each node gets its own process.

  2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.

  3. Each process inits the model.

  4. Each process performs a full forward and backward pass in parallel.

  5. The gradients are synced and averaged across all processes.

  1. Each process updates its optimizer.

Here is an example on how it should be used:

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp")

# train on 32 GPUs (4 nodes)
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp", num_nodes=4)

Now for the last task, go to Ulysses cluster, clone the repository and checkout the deep_classifier branch. Install all the requirements and the package. Update the config file to train on 1 node with 2 GPUs, and run on Ulysses the application. In order to run, build a submit.sbatch file inside shell/. You can use the following:

#!/bin/bash

# SLURM job options
#SBATCH --partition=gpu2
#SBATCH --job-name=YOUR NAME here
#SBATCH --nodes=HOW MANY NODES?
#SBATCH --gpus=HOW MANY GPUS?
#SBATCH --ntasks-per-node=HOW MANY TAKS TO SOLVE IN EACH NODE?
#SBATCH --gpus-per-task=HOW MANY GPUS FOR EACH TASK?
#SBATCH --mem=YOUR MEMORY
#SBATCH --time=00:10:00
#SBATCH --output=%x.o%j.%N
#SBATCH --error=%x.e%j.%N

# Print job details
NOW=`date +%H:%M-%a-%d/%b/%Y`
echo '------------------------------------------------------'
echo 'This job is allocated on '$SLURM_JOB_CPUS_PER_NODE' cpu(s)'
echo 'Job is running on node(s): '
echo  $SLURM_JOB_NODELIST
echo '------------------------------------------------------'
#
# ==== End of Info part (say things) ===== #
#

cd $SLURM_SUBMIT_DIR            # here we go into the submission directory
export SLURM_NTASKS_PER_NODE=2  # need to export this, not for all clusters but Ulysses has a bug :/

module load cuda/12.1           # Loading cuda
conda activate devtools_scicomp # activate the environment

# Run the script
srun python ....

Solutions

The repository with the right structure and commits is reported here: GitHub repo