This document is for people who need fine-tuning KoBERT model. Currently, the original author’s PyTorch code for fine-tuning works fine if you follow the Colab tutorial on the official website as it is, but MXNet code does not work properly with the latest GluonNLP(0.9.1) version. So we modified it so that we can perform fine-tuning by referring to the GluonNLP tutorial (See ).
Notes
Since Korean has a lot of vocabulary, it takes 2-3 times more time than English for fine tuning. Therefore, it is not recommended as a real-time hands-on lab. It is possible by sampling the dataset and training only 1 epoch. However, GPU-family(p, g) instances are recommended.
The following website is strongly recommended as a tutorial for BERT fine-tuning. Fine-tuning takes less than 10 minutes.
Naver Movie Review data is publicly available at and consists of 150,000 training data and 50,000 test data. This data is often used for NLP benchmarking like IMDB review data in Korea. Sample data is shown below.
Prerequisites
SageMaker GPU notebook instances or EC2 DLAMI
Since CUDA 10.0 is the default, you must upgrade to CUDA 10.1 or 10.2. To respond to the latest version of gluonnlp, MXNet 1.6.0 installation is required, . Alternatively, you can use CUDA 10.1 DLAMI in the marketplace. but MXNet 1.6.0 does not support CUDA 10.0
Installation
$ wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
$ sudo sh cuda_10.2.89_440.33.01_linux.run
Step 1. Setting
Download the latest version of KoBERT. As of 2020/5/12, the latest version is 0.1.1.
Add one layer for classifier training. You can use the original source as it is, but it is convenient to use the model.BERTClassifier() method supported by GluonNLP. Since the first output is the sequential embedding and the second output is the class embedding, the second output is used for fine-tuning.
Total training time is approximately 40 minutes when using p3.2xlarge instances.
import os
output_dir = './model_save/'
# Create output directory if needed
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print("Saving model to %s" % output_dir)
training_stats = []
# Measure the total training time for the whole run.
total_t0 = time.time()
for epoch_id in range(num_epochs):
# === Training phase ===
# Measure how long the training epoch takes.
t0 = time.time()
metric.reset()
step_loss = 0
total_loss = 0
for batch_id, (token_ids, segment_ids, valid_length, label) in enumerate(tqdm(train_dataloader)):
with mx.autograd.record():
# Load the data to the GPU.
token_ids = token_ids.as_in_context(ctx)
valid_length = valid_length.as_in_context(ctx)
segment_ids = segment_ids.as_in_context(ctx)
label = label.as_in_context(ctx)
# Forward computation
out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
ls = loss_function(out, label).mean()
# Perform a backward pass to calculate the gradients
ls.backward()
# Gradient clipping
# step() can be used for normal parameter updates, but if we apply gradient clipping,
# you need to manaully call allreduce_grads() and update() separately.
trainer.allreduce_grads()
nlp.utils.clip_grad_global_norm(params, max_grad_norm)
trainer.update(1)
step_loss += ls.asscalar()
total_loss += ls.asscalar()
metric.update([label], [out])
# Printing vital information
if (batch_id + 1) % (log_interval) == 0:
print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
.format(epoch_id, batch_id + 1, len(train_dataloader),
step_loss / log_interval,
trainer.learning_rate, metric.get()[1]))
step_loss = 0
train_avg_acc = metric.get()[1]
train_avg_loss = total_loss / batch_id
total_loss = 0
# Measure how long this epoch took.
train_time = format_time(time.time() - t0)
# === Validation phase ===
# Measure how long the validation epoch takes.
t0 = time.time()
valid_avg_acc, valid_avg_loss = evaluate_accuracy(bert_classifier, test_dataloader, ctx)
# Measure how long this epoch took.
valid_time = format_time(time.time() - t0)
# Measure how long the validation run took.
validation_time = format_time(time.time() - t0)
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch_id + 1,
'train_acc': train_avg_acc,
'train_loss': train_avg_loss,
'train_time': train_time,
'valid_acc': valid_avg_acc,
'valid_loss': valid_avg_loss,
'valid_time': valid_time
}
)
# === Save Model Parameters ===
bert_classifier.save_parameters('{}/net_epoch{}.params'.format(output_dir, epoch_id))
If you run the code, you will get the following result.
Total training time is approximately 12 minutes when using p3.8xlarge instances.
import os
output_dir = './model_save/'
# Create output directory if needed
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print("Saving model to %s" % output_dir)
training_stats = []
step_num = 0
# Measure the total training time for the whole run.
total_t0 = time.time()
num_epochs = 1
for epoch_id in range(num_epochs):
# === Training phase ===
# Measure how long the training epoch takes.
t0 = time.time()
metric.reset()
step_loss = 0
total_loss = 0
for batch_id, (token_ids, segment_ids, valid_length, label) in enumerate(train_dataloader):
# Load the data to the GPUs
token_ids_ = gluon.utils.split_and_load(token_ids, ctx, even_split=False)
valid_length_ = gluon.utils.split_and_load(valid_length, ctx, even_split=False)
segment_ids_ = gluon.utils.split_and_load(segment_ids, ctx, even_split=False)
label_ = gluon.utils.split_and_load(label, ctx, even_split=False)
losses = []
with autograd.record():
for t, v, s, l in zip(token_ids_, valid_length_, segment_ids_, label_):
# Forward computation
out = bert_classifier(t, s, v.astype('float32'))
ls = loss_function(out, l).mean()
losses.append(ls)
metric.update([l], [out])
# Perform a backward pass to calculate the gradients
for ls in losses:
ls.backward()
trainer.step(1)
# sum losses over all devices
step_loss += sum([l.sum().asscalar() for l in losses])
total_loss += sum([l.sum().asscalar() for l in losses])
# Printing vital information
if (batch_id + 1) % (log_interval) == 0:
print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
.format(epoch_id, batch_id + 1, len(train_dataloader),
step_loss / log_interval,
trainer.learning_rate, metric.get()[1]))
step_loss = 0
train_avg_acc = metric.get()[1]
train_avg_loss = total_loss / batch_id
total_loss = 0
# Measure how long this epoch took.
train_time = format_time(time.time() - t0)
# === Validation phase ===
# Measure how long the validation epoch takes.
t0 = time.time()
valid_avg_acc, valid_avg_loss = evaluate_accuracy(bert_classifier, test_dataloader, ctx)
# Measure how long this epoch took.
valid_time = format_time(time.time() - t0)
# Measure how long the validation run took.
validation_time = format_time(time.time() - t0)
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch_id + 1,
'train_acc': train_avg_acc,
'train_loss': train_avg_loss,
'train_time': train_time,
'valid_acc': valid_avg_acc,
'valid_loss': valid_avg_loss,
'valid_time': valid_time
}
)
# === Save Model Parameters ===
bert_classifier.save_parameters('{}/net_epoch{}.params'.format(output_dir, epoch_id))
If you run the code, you will get the following result.
Overfitting occurs from the 4th epoch usually because the validation metrics are increasing while the training metrics are decreasing. Thus, we store the results of the 3rd epoch as final model parameters. The validation accuracy was pretty good with 89.6~89.7% accuracy, which is less than the 90.1% accuracy on the official site, but has not been hyperparameter tuned.
Even training only 1 epoch shows 88% accuracy, so we will convert it to the SageMaker for hands-on lab in the future.
After completing the training, compress the vocab file(.spiece) and model file into model.tar.gz and save it in Amazon S3 in order to create the SageMaker endpoint.
$ cp ~/kobert/kobert_news_wiki_ko_cased-1087f8699e.spiece ./model_save/.
$ cd model_save
$ tar cvfz model.tar.gz ./*.params ./*.spiece
$ aws s3 cp ./model.tar.gz s3://your-bucket-name/kobert-model/model.tar.gz
Step 3. Deploying to SageMaker Endpoint to perform Inference
A great tutorial has already been introduced in the AWS Korea AIML blog. Based on this method, it is easy to perform endpoint deployment by making minor modifications.
Modify DockerFile
Basic contents can be done in the same way as for blogs. When editing Dockerfile(Based on ./docker/1.6.0/py3/Dockerfile.gpu), you need to edit as follows. (If you do not use KoGPT2, you can delete 4 lines below #For KoGPT2 installation.)
RUN ${PIP} install --no-cache-dir \
${MX_URL} \
git+git://github.com/dmlc/gluon-nlp.git@v0.9.0 \
gluoncv==0.6.0 \
mxnet-model-server==$MMS_VERSION \
keras-mxnet==2.2.4.1 \
numpy==1.17.4 \
onnx==1.4.1 \
"sagemaker-mxnet-inferenc>2"
# For KoBERT installation
RUN git clone https://github.com/SKTBrain/KoBERT.git \
&& cd KoBERT \
&& ${PIP} install -r requirements.txt \
&& ${PIP} install .
# For KoGPT2 installation
RUN git clone https://github.com/SKT-AI/KoGPT2.git \
&& cd KoGPT2 \
&& ${PIP} install -r requirements.txt \
&& ${PIP} install .
RUN ${PIP} uninstall -y mxnet ${MX_URL}
RUN ${PIP} install ${MX_URL}
Next, the process of building Docker images, testing them, and then building them in ECR is the same as for blogs.
SageMaker
Now you can paste the script code below in the SageMaker notebook instance and then create the endpoint by specifying the script code as the entrypoint. The code example is shown below.
Note that the endpoint deployment time is about 9-11 minutes when using the GPU and about 7-8 minutes when using the CPU.