[Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod
Last updated
Was this helpful?
Last updated
Was this helpful?
This document is for people who need distributed GPU training using Horovod for experimental purposes. Many steps are similar to what mentioned in Julien Simon’s article() and AWS Documentation(). So I recommend you to view these articles first. If there are some things that aren’t going well (e.g., Downloading the dataset does not work, How to convert the raw data to the TFRecord feature set?, How to fix the error ModuleNotFoundError: No module named 'cv2'?
) please refer this document.
For data preparation and data transformation, we do not need to use a GPU instance such as p2 and p3. Instead, we can start much cheaper instances like t2.large
instance with 1.0TB EBS volume.
For distributed training, we need to use multiple GPU instances like p2, p3, g3 and g4.
You can skip step 1 if you do not want to invent the wheel again because I have stored everything in my s3 bucket.
s3://dataset-image/imagenet/raw
(raw jpeg)
s3://dataset-image/imagenet/tfrecord
(TFRecord before resizing)
s3://dataset-image/imagenet/tfrecord-resized
(TFRecord after resizing to 224x224)
s3://dataset-image/imagenet/recordio
(RecordIO after resizing to 256x256)
The reason I did not resize 224x224 is that the below article shows different validation accuracy for resizing strategy.
Please let me know if you want to access the bucket because I did not grant any public access.
Create an EC2 instance for storing ImageNet dataset (Ubuntu 18.04 or 16.04. Linux is also available). t2.micro
is also available, but t2.large
is recommended due to memory size. Note that we do not need large storage size since we will make another EBS volume to attach the EC2 instance.
Create an EBS volume (1.0TB) for ImageNet dataset and then attach the volume it to your EC2 instance. ImageNet consists of 138GB for training set and 6.3GB for validation set, but we need an additional space since we need to extract tar files as well as need to transform it to the feature sets like TFRecord and RecordIO. Here is an example command using AWS CLI.
Download MXNet repository and TensorFlow models repository.
[Optional] For your convenience, use symbolic link such that:
Please note that ImageNet server is sometimes unstable so download speed is not fast, taking 4 to 5 days.
Download ImageNet dataset manually.
Extract the validation set
n01728572
Extract the training set
After extracting the training set, check if the number of directories is 1,000 (class 1 is n01728572 and class 1000 is n15075141).
Extract bounding boxes
python preprocess_imagenet.py \ --local_scratch_dir=[YOUR DIRECTORY] \ --imagenet_username=[imagenet account] \ --imagenet_access_key=[imagenet access key]
python tensorflow_image_resizer.py \ -d imagenet \ -i [PATH TO TFRECORD TRAINING DATASET] \ -o [PATH TO RESIZED TFRECORD TRAINING DATASET] \ --subset_name train \ --num_preprocess_threads 60 \ --num_intra_threads 2 \ --num_inter_threads 2
[Additional Notes] The original document uses the small number of intra-op(multiple threads within one op; for example, while doing matrix multiplication operation we can divide the op by multiple threads) and inter-op(thread-pool size per one executor) such that --num_intra_threads 2 \ --num_inter_threads 2
. But, you can give higher number of intra-op and inter-op.
After data transformation, create a new bucket and sync or copy feature sets to the bucket.
Create a snapshot of the EBS volume.
Create an EC2 instance for Training (Deep Learning AMI (Ubuntu 16.04) or Deep Learning AMI (Amazon Linux)). p3.16xlarge
or p3dn.24xlarge
is recommended if you need to do distributed GPU training using Uber’s Horovod or Tensorflow's DistributedStrategy). Please also note that the default root volume size is 75GB, but I recommend you to increase 100GB since training logs and model checkpoints are stored in the root volume if you do not modify training configuration. If you not want to increase the volume size, then you can delete some conda environments such as Theano, Chainer, Caffe, and Caffe2 after logging in to the EC2 instance.
If you want to train on distributed GPUs, then you need to create multiple GPU instances with the same setting. For example, the below figure shows 8 p3dn.24xlarge
instances.
After training, please check the training log and evaluation log by checking imagenet_resnet
folder:
vd_train_log (32 GPUS; 4 p3dn.24xlarge instances)
eval_hvd_train.log (32 GPUS; 4 p3dn.24xlarge instances)
hvd_train_log (64 GPUS; 8 p3dn.24xlarge instances)
eval_hvd_train.log (64 GPUS; 8 p3dn.24xlarge instances)
Format the EBS volume, mount it on /data
, and then change the owner to ec2-user:ec2-user
. You may refer to if you do not know how to mount it.
[Important Step] You need to install OpenCV also. (Both 3.x and 4.x work well). If you do not install OpenCV, then you cannot convert ImageNet raw data to RecordIO files since im2rec.py
utilizes some OpenCV functions. You may refer to .
[Caution] I strongly recommend to use Python2 instead of Python3 because many codes of Tensorflow models repository does not work on Python3. Please refer to .
Go to , sign up, and get your own username and access key.
You can use the TensorFlow's download script() by exporting your username and access key.
After extracting the validation set, move jpeg files(ILSVRC2012_val_00000001.JPEG, ..., ILSVRC2012_val_00050000.JPEG) in 1,000 directories using the following script; (Each directory means the unique category like ).
Use im2rec.py
the same way Simon did. (). It takes 1.5 days on the t2.large
instance. I think he did some typos (ImageNet baseline usually uses 224x224 size image, but he uses 480x480).
Please refer to (code) and (document).
[Before get started] If you just want to train on a single machine, you may refer to (RecordIO) and (TFRecord)
Please refer to the website for the remaining steps; . Note that all code and all feature sets(TFRecord and RecordIO) must be on the same path on each server.
(written in Korean, but it is very helpful)