Skip to content

Commit

Permalink
Resolving review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
SundarRajan98 committed Mar 26, 2024
1 parent fa0ee0f commit c254b2b
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 93 deletions.
20 changes: 10 additions & 10 deletions rocAL_pybind/examples/rocAL_training_example/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,13 @@
# ImageNet training in PyTorch

This implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
This version has been modified to use rocAL. It assumes that the dataset is raw JPEGs from the ImageNet dataset. If offers CPU and GPU based pipeline for rocAL - use rocal-cpu switch to enable CPU one and use rocal-gpu switch to enable GPU one.

To run use the following command
```bash
rm *.pth.tar # Remove older checkpoints saved in the folder if the example has been run before
python3 main.py -a resnet50 --dist-url='tcp://127.0.0.1:4321' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 -j$(nproc) --batch-size 1024 --rocal-cpu --epochs 91 /media/imageNetCompleteDataset/
```
This example implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
This version has been modified to use rocAL. It assumes that the dataset is raw JPEGs from the ImageNet dataset. If offers CPU and GPU based pipeline for rocAL - use `rocal-cpu` switch to enable CPU and use `rocal-gpu` switch to enable GPU.

## Requirements

- Install PyTorch for [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html)
- Install rocAL for running rocAL trainings
- Download the ImageNet dataset from http://www.image-net.org/
- Then, move and extract the training and validation images to labeled subfolders, using [the following shell script](extract_ILSVRC.sh)
- Download the ImageNet dataset from http://www.image-net.org/ and use [the following shell script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) to move and extract the training and validation images to labeled subfolders

## Training

Expand All @@ -29,6 +22,13 @@ The default learning rate schedule starts at 0.1 and decays by a factor of 10 ev
```bash
python main.py -a alexnet --lr 0.01 [imagenet-folder with train and val folders]
```
To run a rocAL integrated training, use `rocal-cpu` or `rocal-gpu`

```bash
python3 main.py -a resnet50 -j$(nproc) --batch-size 1024 --rocal-cpu [imagenet-folder with train and val folders]
```

Make sure to remove older checkpoints (`rm *.pth.tar`) saved in the folder if the example has been run before

## Use Dummy Data

Expand Down
80 changes: 0 additions & 80 deletions rocAL_pybind/examples/rocAL_training_example/extract_ILSVRC.sh

This file was deleted.

2 changes: 1 addition & 1 deletion rocAL_pybind/examples/rocAL_training_example/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ def main():
if torch.cuda.is_available():
ngpus_per_node = torch.cuda.device_count()
if ngpus_per_node == 1 and args.dist_backend == "nccl":
warnings.warn("nccl backend >=2.5 requires GPU count>1, see https://github.com/NVIDIA/nccl/issues/103 perhaps use 'gloo'")
warnings.warn("nccl backend >=2.5 requires GPU count>1, perhaps use 'gloo'")
else:
ngpus_per_node = 1

Expand Down
2 changes: 0 additions & 2 deletions rocAL_pybind/examples/rocAL_training_example/requirements.txt

This file was deleted.

0 comments on commit c254b2b

Please sign in to comment.