🖋️
noviceforever
  • About me
  • Miscellaneous
    • Introduction
      • 커리어 요약
      • 데이터 과학이란?
  • Machine Learning
    • Tabular Data
      • XGBoost Algorithm Overview
      • TabNet Overview
      • Imbalanced Learning
        • Introduction
        • Oversampling Basic (SMOTE variants)
        • Undersampling Basic
        • Cost-sensitive Learning
        • RBF(Radial Basis Function)-based Approach
    • Computer Vision (CNN-based)
      • [Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod
      • R-CNN(Regions with Convolutional Neuron Networks)
      • Fast R-CNN
      • Faster R-CNN
      • Mask R-CNN
      • YOLO (You Only Look Once)
      • YOLO v2(YOLO 9000) Better, Faster, Stronger
      • YOLO v3
      • SSD (Single Shot Multibox Detector)
      • Data Augmentation Tips
    • Computer Vision (Transformer-based)
      • ViT for Image Classification
      • DeiT (Training Data-efficient Image Transformers & Distillation through Attention)
      • DETR for Object Detection
      • Zero-Shot Text-to-Image Generation (DALL-E) - Paper Review
    • Natural Language Processing
      • QRNN(Quasi-Recurrent Neural Network)
      • Transformer is All You Need
      • BERT(Bi-directional Encoder Representations from Transformers)
      • DistilBERT, a distilled version of BERT
      • [Hands-on] Fine Tuning Naver Movie Review Sentiment Classification with KoBERT using GluonNLP
      • OpenAI GPT-2
      • XLNet: Generalized Autoregressive Pretraining for Language Understanding
    • Recommendation System
      • Recommendation System Overview
      • Learning to Rank
      • T-REC(Towards Accurate Bug Triage for Technical Groups) 논문 리뷰
    • Reinforcement Learning
      • MAB(Multi-Armed Bandits) Overview
      • MAB Algorithm Benchmarking
      • MAB(Multi-Armed Bandits) Analysis
      • Policy Gradient Overview
    • IoT on AWS
      • MXNet Installation on NVIDIA Jetson Nano
      • Neo-DLR on NVIDIA Jetson Nano
    • Distributed Training
      • Data Parallelism Overview
      • SageMaker's Data Parallelism Library
      • SageMaker's Model Parallelism Library
    • Deployment
      • MobileNet V1/V2/V3 Overview
      • TensorRT Overview
      • Multi Model Server and SageMaker Multi-Model Endpoint Overview
  • AWS AIML
    • Amazon Personalize
      • Amazon Personalize - User Personalization Algorithm Deep Dive
      • Amazon Personalize Updates(~2021.04) 및 FAQ
Powered by GitBook
On this page
  • Goal
  • Introduction
  • Step 1. Downloading and Transformation
  • Setting up an EC2 instance for Data Transformation
  • Downloading ImageNet
  • Data Transformation
  • Backing up and Copying to S3
  • Step 2. Training ResNet-50 Model with Horovod
  • References

Was this helpful?

  1. Machine Learning
  2. Computer Vision (CNN-based)

[Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod

PreviousComputer Vision (CNN-based)NextR-CNN(Regions with Convolutional Neuron Networks)

Last updated 4 years ago

Was this helpful?

Goal

This document is for people who need distributed GPU training using Horovod for experimental purposes. Many steps are similar to what mentioned in Julien Simon’s article() and AWS Documentation(). So I recommend you to view these articles first. If there are some things that aren’t going well (e.g., Downloading the dataset does not work, How to convert the raw data to the TFRecord feature set?, How to fix the error ModuleNotFoundError: No module named 'cv2'?) please refer this document.

Introduction

For data preparation and data transformation, we do not need to use a GPU instance such as p2 and p3. Instead, we can start much cheaper instances like t2.large instance with 1.0TB EBS volume.

For distributed training, we need to use multiple GPU instances like p2, p3, g3 and g4.

You can skip step 1 if you do not want to invent the wheel again because I have stored everything in my s3 bucket.

  • s3://dataset-image/imagenet/raw (raw jpeg)

  • s3://dataset-image/imagenet/tfrecord (TFRecord before resizing)

  • s3://dataset-image/imagenet/tfrecord-resized (TFRecord after resizing to 224x224)

  • s3://dataset-image/imagenet/recordio (RecordIO after resizing to 256x256)

    • The reason I did not resize 224x224 is that the below article shows different validation accuracy for resizing strategy.

Please let me know if you want to access the bucket because I did not grant any public access.

Step 1. Downloading and Transformation

Setting up an EC2 instance for Data Transformation

  • Create an EC2 instance for storing ImageNet dataset (Ubuntu 18.04 or 16.04. Linux is also available). t2.micro is also available, but t2.large is recommended due to memory size. Note that we do not need large storage size since we will make another EBS volume to attach the EC2 instance.

  • Create an EBS volume (1.0TB) for ImageNet dataset and then attach the volume it to your EC2 instance. ImageNet consists of 138GB for training set and 6.3GB for validation set, but we need an additional space since we need to extract tar files as well as need to transform it to the feature sets like TFRecord and RecordIO. Here is an example command using AWS CLI.

      $ aws ec2 create-volume \
      --size 1000 \
      --region [YOUR_AWS_REGION] \
      --availability-zone [YOUR_AZ> \
      --volume-type sc1 \
      --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=ImageNet}]'
    
      $ aws ec2 attach-volume \
      --volume-id vol-[YOUR_EC2_volume_id] \
      --instance-id i-[YOUR_EC2_instance_id] \
      --device /dev/sdf
  • Download MXNet repository and TensorFlow models repository.

      $ cd /data
      $ git clone https://github.com/tensorflow/models.git
      $ git clone https://github.com/apache/incubator-mxnet.git 
      # or, you can just type `pip install mxnet`
    • [Optional] For your convenience, use symbolic link such that:

Downloading ImageNet

Please note that ImageNet server is sometimes unstable so download speed is not fast, taking 4 to 5 days.

Method 1

  •   $ export IMAGENET_USERNAME=[YOUR_USERNAME]
      $ export IMAGENET_ACCESS_KEY=[YOUR_ACCESS_KEY]
      $ cd imagenet/data
      $ mv imagenet_2012_validation_synset_labels.txt synsets.txt
      $ nohup bash download_imagenet.sh . synsets.txt >& download.log &

Method 2 (Alternative method if Method 1 does not work)

  • Download ImageNet dataset manually.

      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_bbox_train_v2.tar.gz &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar &
      $ nohup wget http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_bbox_val_v3.tgz &
  • Extract the validation set

      $ mkdir validation
      $ mv ILSVRC2012_img_val.tar validation
      $ cd validation
      $ tar xf ILSVRC2012_img_val.tar
    • n01728572

  • Extract the training set

      $ mkdir train
      $ mv ILSVRC2012_img_train.tar train
      $ cd train
      $ tar xf ILSVRC2012_img_train.tar
      $ find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
    • After extracting the training set, check if the number of directories is 1,000 (class 1 is n01728572 and class 1000 is n15075141).

  • Extract bounding boxes

      $ mkdir bounding_boxes
      $ mv ILSVRC2012_bbox_train_v2.tar.gz bounding_boxes
      $ mv ILSVRC2012_bbox_val_v3.tgz bounding_boxes
      $ cd bounding_boxes
      $ tar xzf ILSVRC2012_bbox_val_v3.tgz
      $ mkdir train
      $ mv ILSVRC2012_bbox_train_v2.tar.gz train
      $ cd train
      $ tar xzf ILSVRC2012_bbox_train_v2.tar.gz

Data Transformation

RecordIO format

TFRecord format

    • python preprocess_imagenet.py \ --local_scratch_dir=[YOUR DIRECTORY] \ --imagenet_username=[imagenet account] \ --imagenet_access_key=[imagenet access key]

    • python tensorflow_image_resizer.py \ -d imagenet \ -i [PATH TO TFRECORD TRAINING DATASET] \ -o [PATH TO RESIZED TFRECORD TRAINING DATASET] \ --subset_name train \ --num_preprocess_threads 60 \ --num_intra_threads 2 \ --num_inter_threads 2

  • [Additional Notes] The original document uses the small number of intra-op(multiple threads within one op; for example, while doing matrix multiplication operation we can divide the op by multiple threads) and inter-op(thread-pool size per one executor) such that --num_intra_threads 2 \ --num_inter_threads 2. But, you can give higher number of intra-op and inter-op.

Backing up and Copying to S3

  • After data transformation, create a new bucket and sync or copy feature sets to the bucket.

  • Create a snapshot of the EBS volume.

Step 2. Training ResNet-50 Model with Horovod

  • Create an EC2 instance for Training (Deep Learning AMI (Ubuntu 16.04) or Deep Learning AMI (Amazon Linux)). p3.16xlarge or p3dn.24xlarge is recommended if you need to do distributed GPU training using Uber’s Horovod or Tensorflow's DistributedStrategy). Please also note that the default root volume size is 75GB, but I recommend you to increase 100GB since training logs and model checkpoints are stored in the root volume if you do not modify training configuration. If you not want to increase the volume size, then you can delete some conda environments such as Theano, Chainer, Caffe, and Caffe2 after logging in to the EC2 instance.

  • If you want to train on distributed GPUs, then you need to create multiple GPU instances with the same setting. For example, the below figure shows 8 p3dn.24xlarge instances.

  • After training, please check the training log and evaluation log by checking imagenet_resnet folder:

  • vd_train_log (32 GPUS; 4 p3dn.24xlarge instances)

      - Step Epoch Speed Loss FinLoss LR
      - 0 0.0 952.2 6.923 8.262 0.00100
      - 1 0.0 2686.6 6.928 8.267 0.00305
      - 50 0.3 22243.7 6.586 7.919 0.10353
      - ..
      - 14000 89.5 21021.1 0.750 1.152 0.00012
      - 14050 89.8 21818.7 0.583 0.985 0.00002
      - Finished in 5289.161954164505
  • eval_hvd_train.log (32 GPUS; 4 p3dn.24xlarge instances)

  • hvd_train_log (64 GPUS; 8 p3dn.24xlarge instances)

      - Step Epoch Speed Loss FinLoss LR
      - 0 0.0 1907.3 6.920 8.259 0.00100
      - 1 0.0 5164.9 6.935 8.274 0.00920
      - 50 0.6 43926.5 6.206 7.522 0.41119
      - ...
      - 6950 88.9 43552.2 0.783 1.185 0.00125
      - 7000 89.5 41958.4 0.624 1.027 0.00023
      - Finished in 2685.1825189590454
  • eval_hvd_train.log (64 GPUS; 8 p3dn.24xlarge instances)

References

Format the EBS volume, mount it on /data, and then change the owner to ec2-user:ec2-user. You may refer to if you do not know how to mount it.

[Important Step] You need to install OpenCV also. (Both 3.x and 4.x work well). If you do not install OpenCV, then you cannot convert ImageNet raw data to RecordIO files since im2rec.py utilizes some OpenCV functions. You may refer to .

[Caution] I strongly recommend to use Python2 instead of Python3 because many codes of Tensorflow models repository does not work on Python3. Please refer to .

Go to , sign up, and get your own username and access key.

You can use the TensorFlow's download script() by exporting your username and access key.

After extracting the validation set, move jpeg files(ILSVRC2012_val_00000001.JPEG, ..., ILSVRC2012_val_00050000.JPEG) in 1,000 directories using the following script; (Each directory means the unique category like ).

Use im2rec.py the same way Simon did. (). It takes 1.5 days on the t2.large instance. I think he did some typos (ImageNet baseline usually uses 224x224 size image, but he uses 480x480).

Please refer to (code) and (document).

[Before get started] If you just want to train on a single machine, you may refer to (RecordIO) and (TFRecord)

Please refer to the website for the remaining steps; . Note that all code and all feature sets(TFRecord and RecordIO) must be on the same path on each server.

(written in Korean, but it is very helpful)

https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html
https://forums.fast.ai/t/impact-of-image-resizing-on-model-training-time-and-performance/1980
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html
https://www.pyimagesearch.com/2018/08/15/how-to-install-opencv-4-on-ubuntu/
https://stackoverflow.com/questions/38546672/inception-build-imagenet-data-py-typeerror-rgb-has-type-class-str-but-ex
http://www.image-net.org/
https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_imagenet.sh
https://github.com/juliensimon/aws/blob/master/mxnet/imagenet/build_validation_tree.sh
https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72
https://github.com/aws-samples/deep-learning-models/blob/master/utils/tensorflow/preprocess_imagenet.py
https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html
https://medium.com/@julsimon/imagenet-part-2-the-road-goes-ever-on-and-on-578f09a749f9
https://github.com/tensorflow/models/tree/master/official/r1/resnet
https://aws.amazon.com/ko/getting-started/tutorials/get-started-dlami/
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html
https://aws.amazon.com/ko/getting-started/tutorials/get-started-dlami/
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html
http://hwengineer.blogspot.com/2017/12/tensorflow-ilsvrc2012-dataset-resnet101.html
https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72
https://medium.com/@julsimon/imagenet-part-2-the-road-goes-ever-on-and-on-578f09a749f9