To efficiently train our model on larger datasets and scale it across multiple GPUs, we will utilize PyTorch Lightning, a high-level framework built on top of PyTorch. PyTorch Lightning simplifies the process of distributed and multi-GPU training, making it easier to implement, scale, and manage training workloads without having to manage the underlying complexities.
Install Required Tools and structure the package:
devtools_scicomp
, install PyTorch Lightning and add it to the requirements.txt
file. Do the same for tensorboard
and torchvision
.devtools_scicomp_project_2025
repository create a new branch starting from the main
one called deep_classifier
.src/pyclassify/
create model.py
, module.py
, datamodule.py
.Inside the src/pyclassify/model.py
create the AlexNet. You might want to use nn.Conv2d
, nn.ReLU
, nn.MaxPool2d
, nn.Dropout
in your implementation:
class AlexNet(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
self.features = nn.Sequential(
# here insert convolutional blocks
)
self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
self.classifier = nn.Sequential(
# here insert the linear + dropout blocks
)
def forward(self, x):
x = self.avgpool(self.features(x)).flatten(start_dim=1)
logits = self.classifier(x)
return logits
src/pyclassify/module.py
create the Classifier
module using PyTorch Lightning:
In the __init__
method instantiate the AlexNet
model, create the class attributes train_accuracy
by:
self.train_accuracy = torchmetrics.classification.Accuracy(task="multiclass", num_classes=???)
choose the right number of classes. Do it also for val_accuracy
and test_accuracy
.
Write a private method called _classifier_step
which takes as input the batch and: (1) extract features, true_labels from it, (2) computes the logits by performing a forward pass (just call self(features)
), (3) compute the CrossEntropyLoss, (4) returns the predicted label (the most probable), the true label and the loss.
Set up the training_step
, validation_step
, test_step
. You should call the _classifier_step
given a batch, log the accuracy, and for the training_step
only return the loss as well (otherwise the module does not know what to optimize). You can log by
self.log('train_accuracy', self.train_accuracy, on_step=True, on_epoch=False)
For more on logging read here.
Finally, add the following class method to configure the optimizer
def forward(self, x):
return self.model(x)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
return optimizer
Inside src/pyclassify/datamodule.py
create the Lightning Datamodule:
class CIFAR10DataModule(pl.LightningDataModule):
def __init__(self, data_path=WHERE_TO_SAVE, batch_size=64):
super().__init__()
self.data_path = data_path
self.batch_size = batch_size
def prepare_data(self):
datasets.CIFAR10(root=self.data_path, download=True)
self.transform = transforms.Compose(
[transforms.Resize((70, 70)), transforms.RandomCrop((64, 64)),
transforms.ToTensor()])
def setup(self, stage=None):
train = datasets.CIFAR10(
root=self.data_path,
train=True,
transform=self.transform,
download=False,
)
self.train, self.valid = random_split(train, lengths=[45000, 5000])
self.test = datasets.CIFAR10(
root=self.data_path,
train=False,
transform=self.transform,
download=False,
)
def train_dataloader(self):
# build train dataloader
def val_dataloader(self):
# build val dataloader
def test_dataloader(self):
# build test dataloader
Implementing a command line interface (CLI) makes it possible to execute an experiment from a shell terminal. By having a CLI, there is a clear separation between the Python source code and what hyperparameters are used for a particular experiment. If the CLI corresponds to a stable version of the code, reproducing an experiment can be achieved by installing the same version of the code plus dependencies and running with the same configuration.
Lightning projects usually begin with one model and one dataset. As the project grows in complexity and you introduce more models and more datasets, it becomes desirable to mix any model with any dataset directly from the command line without changing your code.
# Mix and match anything
$ python main.py fit --model=GAN --data=MNIST
$ python main.py fit --model=Transformer --data=MNIST
LightningCLI
makes this very simple. Otherwise, this kind of configuration requires a significant amount of boilerplate that often looks like this:
# choose model
if args.model == "gan":
model = GAN(args.feat_dim)
elif args.model == "transformer":
model = Transformer(args.feat_dim)
...
# choose datamodule
if args.data == "MNIST":
datamodule = MNIST()
elif args.data == "imagenet":
datamodule = Imagenet()
...
# mix them!
trainer.fit(model, datamodule)
It is highly recommended that you avoid writing this kind of boilerplate and use LightningCLI
instead. Let's build one! Inside scripts/run.py
setup the CLI. This is very easy, just write the following:
from lightning.pytorch.cli import LightningCLI
import pyclassify.model
import pyclassify.module
import pyclassify.datamodule
cli = LightningCLI(subclass_mode_data=True, subclass_mode_model=True)
python scripts/run.py fit --print_config > experiments/config.yaml
Update the config file to handle module kwargs, and run the CLI on CPU for 2 epoch
using:
python scripts/run.py fit --config=experiments/config.yaml
If everything works fine, push your changes to origin/deep_classifier
.
Distributed Data Parallel training (DDP) works as follows:
Each GPU across each node gets its own process.
Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
Each process inits the model.
Each process performs a full forward and backward pass in parallel.
The gradients are synced and averaged across all processes.
Here is an example on how it should be used:
# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp")
# train on 32 GPUs (4 nodes)
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp", num_nodes=4)
Now for the last task, go to Ulysses cluster, clone the repository and checkout the deep_classifier
branch. Install all the requirements and the package. Update the config file to train on 1 node with 2 GPUs, and run on Ulysses the application. In order to run, build a submit.sbatch
file inside shell/
. You can use the following:
#!/bin/bash
# SLURM job options
#SBATCH --partition=gpu2
#SBATCH --job-name=YOUR NAME here
#SBATCH --nodes=HOW MANY NODES?
#SBATCH --gpus=HOW MANY GPUS?
#SBATCH --ntasks-per-node=HOW MANY TAKS TO SOLVE IN EACH NODE?
#SBATCH --gpus-per-task=HOW MANY GPUS FOR EACH TASK?
#SBATCH --mem=YOUR MEMORY
#SBATCH --time=00:10:00
#SBATCH --output=%x.o%j.%N
#SBATCH --error=%x.e%j.%N
# Print job details
NOW=`date +%H:%M-%a-%d/%b/%Y`
echo '------------------------------------------------------'
echo 'This job is allocated on '$SLURM_JOB_CPUS_PER_NODE' cpu(s)'
echo 'Job is running on node(s): '
echo $SLURM_JOB_NODELIST
echo '------------------------------------------------------'
#
# ==== End of Info part (say things) ===== #
#
cd $SLURM_SUBMIT_DIR # here we go into the submission directory
export SLURM_NTASKS_PER_NODE=2 # need to export this, not for all clusters but Ulysses has a bug :/
module load cuda/12.1 # Loading cuda
conda activate devtools_scicomp # activate the environment
# Run the script
srun python ....
The repository with the right structure and commits is reported here: GitHub repo