[Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod

Goal

This document is for people who need distributed GPU training using Horovod for experimental purposes. Many steps are similar to what mentioned in Julien Simon’s article(https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72) and AWS Documentation(https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html). So I recommend you to view these articles first. If there are some things that aren’t going well (e.g., Downloading the dataset does not work, How to convert the raw data to the TFRecord feature set?, How to fix the error ModuleNotFoundError: No module named 'cv2'?) please refer this document.

Introduction

For data preparation and data transformation, we do not need to use a GPU instance such as p2 and p3. Instead, we can start much cheaper instances like t2.large instance with 1.0TB EBS volume.

For distributed training, we need to use multiple GPU instances like p2, p3, g3 and g4.

You can skip step 1 if you do not want to invent the wheel again because I have stored everything in my s3 bucket.

  • s3://dataset-image/imagenet/raw (raw jpeg)

  • s3://dataset-image/imagenet/tfrecord (TFRecord before resizing)

  • s3://dataset-image/imagenet/tfrecord-resized (TFRecord after resizing to 224x224)

  • s3://dataset-image/imagenet/recordio (RecordIO after resizing to 256x256)

Please let me know if you want to access the bucket because I did not grant any public access.

Step 1. Downloading and Transformation

Setting up an EC2 instance for Data Transformation

  • Create an EC2 instance for storing ImageNet dataset (Ubuntu 18.04 or 16.04. Linux is also available). t2.micro is also available, but t2.large is recommended due to memory size. Note that we do not need large storage size since we will make another EBS volume to attach the EC2 instance.

  • Create an EBS volume (1.0TB) for ImageNet dataset and then attach the volume it to your EC2 instance. ImageNet consists of 138GB for training set and 6.3GB for validation set, but we need an additional space since we need to extract tar files as well as need to transform it to the feature sets like TFRecord and RecordIO. Here is an example command using AWS CLI.

      $ aws ec2 create-volume \
      --size 1000 \
      --region [YOUR_AWS_REGION] \
      --availability-zone [YOUR_AZ> \
      --volume-type sc1 \
      --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=ImageNet}]'
    
      $ aws ec2 attach-volume \
      --volume-id vol-[YOUR_EC2_volume_id] \
      --instance-id i-[YOUR_EC2_instance_id] \
      --device /dev/sdf
  • Format the EBS volume, mount it on /data, and then change the owner to ec2-user:ec2-user. You may refer to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html if you do not know how to mount it.

  • Download MXNet repository and TensorFlow models repository.

      $ cd /data
      $ git clone https://github.com/tensorflow/models.git
      $ git clone https://github.com/apache/incubator-mxnet.git 
      # or, you can just type `pip install mxnet`
    • [Optional] For your convenience, use symbolic link such that:

Downloading ImageNet

Please note that ImageNet server is sometimes unstable so download speed is not fast, taking 4 to 5 days.

Method 1

Method 2 (Alternative method if Method 1 does not work)

  • Download ImageNet dataset manually.

      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_bbox_train_v2.tar.gz &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_bbox_val_v3.tgz &
  • Extract the validation set

      $ mkdir validation
      $ mv ILSVRC2012_img_val.tar validation
      $ cd validation
      $ tar xf ILSVRC2012_img_val.tar
  • Extract the training set

      $ mkdir train
      $ mv ILSVRC2012_img_train.tar train
      $ cd train
      $ tar xf ILSVRC2012_img_train.tar
      $ find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
    • After extracting the training set, check if the number of directories is 1,000 (class 1 is n01728572 and class 1000 is n15075141).

  • Extract bounding boxes

      $ mkdir bounding_boxes
      $ mv ILSVRC2012_bbox_train_v2.tar.gz bounding_boxes
      $ mv ILSVRC2012_bbox_val_v3.tgz bounding_boxes
      $ cd bounding_boxes
      $ tar xzf ILSVRC2012_bbox_val_v3.tgz
      $ mkdir train
      $ mv ILSVRC2012_bbox_train_v2.tar.gz train
      $ cd train
      $ tar xzf ILSVRC2012_bbox_train_v2.tar.gz

Data Transformation

RecordIO format

TFRecord format

  • Please refer to https://github.com/aws-samples/deep-learning-models/blob/master/utils/tensorflow/preprocess_imagenet.py (code) and https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html (document).

    • python preprocess_imagenet.py \ --local_scratch_dir=[YOUR DIRECTORY] \ --imagenet_username=[imagenet account] \ --imagenet_access_key=[imagenet access key]

    • python tensorflow_image_resizer.py \ -d imagenet \ -i [PATH TO TFRECORD TRAINING DATASET] \ -o [PATH TO RESIZED TFRECORD TRAINING DATASET] \ --subset_name train \ --num_preprocess_threads 60 \ --num_intra_threads 2 \ --num_inter_threads 2

  • [Additional Notes] The original document uses the small number of intra-op(multiple threads within one op; for example, while doing matrix multiplication operation we can divide the op by multiple threads) and inter-op(thread-pool size per one executor) such that --num_intra_threads 2 \ --num_inter_threads 2. But, you can give higher number of intra-op and inter-op.

Backing up and Copying to S3

  • After data transformation, create a new bucket and sync or copy feature sets to the bucket.

  • Create a snapshot of the EBS volume.

Step 2. Training ResNet-50 Model with Horovod

[Before get started] If you just want to train on a single machine, you may refer to https://medium.com/@julsimon/imagenet-part-2-the-road-goes-ever-on-and-on-578f09a749f9 (RecordIO) and https://github.com/tensorflow/models/tree/master/official/r1/resnet (TFRecord)

  • Create an EC2 instance for Training (Deep Learning AMI (Ubuntu 16.04) or Deep Learning AMI (Amazon Linux)). p3.16xlarge or p3dn.24xlarge is recommended if you need to do distributed GPU training using Uber’s Horovod or Tensorflow's DistributedStrategy). Please also note that the default root volume size is 75GB, but I recommend you to increase 100GB since training logs and model checkpoints are stored in the root volume if you do not modify training configuration. If you not want to increase the volume size, then you can delete some conda environments such as Theano, Chainer, Caffe, and Caffe2 after logging in to the EC2 instance.

  • If you want to train on distributed GPUs, then you need to create multiple GPU instances with the same setting. For example, the below figure shows 8 p3dn.24xlarge instances.

  • vd_train_log (32 GPUS; 4 p3dn.24xlarge instances)

      - Step Epoch Speed Loss FinLoss LR
      - 0 0.0 952.2 6.923 8.262 0.00100
      - 1 0.0 2686.6 6.928 8.267 0.00305
      - 50 0.3 22243.7 6.586 7.919 0.10353
      - ..
      - 14000 89.5 21021.1 0.750 1.152 0.00012
      - 14050 89.8 21818.7 0.583 0.985 0.00002
      - Finished in 5289.161954164505
  • eval_hvd_train.log (32 GPUS; 4 p3dn.24xlarge instances)

  • hvd_train_log (64 GPUS; 8 p3dn.24xlarge instances)

      - Step Epoch Speed Loss FinLoss LR
      - 0 0.0 1907.3 6.920 8.259 0.00100
      - 1 0.0 5164.9 6.935 8.274 0.00920
      - 50 0.6 43926.5 6.206 7.522 0.41119
      - ...
      - 6950 88.9 43552.2 0.783 1.185 0.00125
      - 7000 89.5 41958.4 0.624 1.027 0.00023
      - Finished in 2685.1825189590454
  • eval_hvd_train.log (64 GPUS; 8 p3dn.24xlarge instances)

References

Last updated