[Hands-on] Fine Tuning Naver Movie Review Sentiment Classification with KoBERT using GluonNLP
Author: Daekeun Kim (daekeun@amazon.com)
Goal
This document is for people who need fine-tuning KoBERT model. Currently, the original author’s PyTorch code for fine-tuning works fine if you follow the Colab tutorial on the official website as it is, but MXNet code does not work properly with the latest GluonNLP(0.9.1) version. So we modified it so that we can perform fine-tuning by referring to the GluonNLP tutorial (See https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html).
Notes
Since Korean has a lot of vocabulary, it takes 2-3 times more time than English for fine tuning. Therefore, it is not recommended as a real-time hands-on lab. It is possible by sampling the dataset and training only 1 epoch. However, GPU-family(p, g) instances are recommended.
The following website is strongly recommended as a tutorial for BERT fine-tuning. Fine-tuning takes less than 10 minutes.
Naver Movie Review data is publicly available at https://github.com/e9t/nsmc/, and consists of 150,000 training data and 50,000 test data. This data is often used for NLP benchmarking like IMDB review data in Korea. Sample data is shown below.
Prerequisites
SageMaker GPU notebook instances or EC2 DLAMI
Since CUDA 10.0 is the default, you must upgrade to CUDA 10.1 or 10.2. To respond to the latest version of gluonnlp, MXNet 1.6.0 installation is required, . Alternatively, you can use CUDA 10.1 DLAMI in the marketplace. but MXNet 1.6.0 does not support CUDA 10.0
Download the latest version of KoBERT. As of 2020/5/12, the latest version is 0.1.1.
$gitclonehttps://github.com/SKTBrain/KoBERT.git
Modify kobert/mxnet_kobert.py as follows:
$vimKoBERT/kobert/mxnet_kobert.py
# coding=utf-8# Copyright 2019 SK T-Brain Authors.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.import osimport sysimport requestsimport hashlibimport mxnet as mximport gluonnlp as nlpfrom gluonnlp.model import BERTModel, BERTEncoderfrom.utils import download as _downloadfrom.utils import tokenizermxnet_kobert ={'url':'https://kobert.blob.core.windows.net/models/kobert/mxnet/mxnet_kobert_45b6957552.params','fname':'mxnet_kobert_45b6957552.params','chksum':'45b6957552'}defget_mxnet_kobert_model(use_pooler=True,use_decoder=True,use_classifier=True,ctx=mx.cpu(0),cachedir='~/kobert/'):# download model model_info = mxnet_kobert model_path =_download(model_info['url'], model_info['fname'], model_info['chksum'], cachedir=cachedir)# download vocab vocab_info = tokenizer vocab_path =_download(vocab_info['url'], vocab_info['fname'], vocab_info['chksum'], cachedir=cachedir)returnget_kobert_model(model_path, vocab_path, use_pooler, use_decoder, use_classifier, ctx)definitialize_model(vocab_file,use_pooler,use_decoder,use_classifier,ctx=mx.cpu(0)): vocab_b_obj = nlp.vocab.BERTVocab.from_sentencepiece(vocab_file, padding_token='[PAD]') predefined_args ={'num_layers':12,'units':768,'hidden_size':3072,'max_length':512,'num_heads':12,'dropout':0.1,'embed_size':768,'token_type_vocab_size':2,'word_embed':None,} encoder =BERTEncoder(num_layers=predefined_args['num_layers'], units=predefined_args['units'], hidden_size=predefined_args['hidden_size'], max_length=predefined_args['max_length'], num_heads=predefined_args['num_heads'], dropout=predefined_args['dropout'], output_attention=False, output_all_encodings=False)# BERT net =BERTModel( encoder,len(vocab_b_obj.idx_to_token), token_type_vocab_size=predefined_args['token_type_vocab_size'], units=predefined_args['units'], embed_size=predefined_args['embed_size'], word_embed=predefined_args['word_embed'], use_pooler=use_pooler, use_decoder=use_decoder, use_classifier=use_classifier) net.initialize(ctx=ctx)return vocab_b_obj, netdefget_kobert_pretrained_model(model_file,vocab_file,use_pooler=True,use_decoder=False,use_classifier=False,num_classes=2,ctx=mx.cpu(0)): vocab_b_obj, net =initialize_model(vocab_file, use_pooler, use_decoder, use_classifier, ctx)# Load fine-tuning model classifier = nlp.model.BERTClassifier(net, num_classes=num_classes, dropout=0.5) classifier.classifier.initialize(ctx=ctx) classifier.hybridize(static_alloc=True) classifier.load_parameters(model_file)return (classifier, vocab_b_obj)defget_kobert_model(model_file,vocab_file,use_pooler=True,use_decoder=True,use_classifier=True,ctx=mx.cpu(0)): vocab_b_obj, net =initialize_model(vocab_file, use_pooler, use_decoder, use_classifier, ctx) net.load_parameters(model_file, ctx, ignore_extra=True)return (net, vocab_b_obj)
Add one layer for classifier training. You can use the original source as it is, but it is convenient to use the model.BERTClassifier() method supported by GluonNLP. Since the first output is the sequential embedding and the second output is the class embedding, the second output is used for fine-tuning.
Total training time is approximately 40 minutes when using p3.2xlarge instances.
import osoutput_dir ='./model_save/'# Create output directory if neededifnot os.path.exists(output_dir): os.makedirs(output_dir)print("Saving model to %s"% output_dir)training_stats = []# Measure the total training time for the whole run.total_t0 = time.time()for epoch_id inrange(num_epochs):# === Training phase ===# Measure how long the training epoch takes. t0 = time.time() metric.reset() step_loss =0 total_loss =0for batch_id, (token_ids, segment_ids, valid_length, label) inenumerate(tqdm(train_dataloader)):with mx.autograd.record():# Load the data to the GPU. token_ids = token_ids.as_in_context(ctx) valid_length = valid_length.as_in_context(ctx) segment_ids = segment_ids.as_in_context(ctx) label = label.as_in_context(ctx)# Forward computation out =bert_classifier(token_ids, segment_ids, valid_length.astype('float32')) ls =loss_function(out, label).mean()# Perform a backward pass to calculate the gradients ls.backward()# Gradient clipping# step() can be used for normal parameter updates, but if we apply gradient clipping, # you need to manaully call allreduce_grads() and update() separately. trainer.allreduce_grads() nlp.utils.clip_grad_global_norm(params, max_grad_norm) trainer.update(1) step_loss += ls.asscalar() total_loss += ls.asscalar() metric.update([label], [out])# Printing vital informationif (batch_id +1) % (log_interval) ==0:print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}' .format(epoch_id, batch_id +1, len(train_dataloader), step_loss / log_interval, trainer.learning_rate, metric.get()[1])) step_loss =0 train_avg_acc = metric.get()[1] train_avg_loss = total_loss / batch_id total_loss =0# Measure how long this epoch took. train_time =format_time(time.time() - t0)# === Validation phase ===# Measure how long the validation epoch takes. t0 = time.time() valid_avg_acc, valid_avg_loss =evaluate_accuracy(bert_classifier, test_dataloader, ctx)# Measure how long this epoch took. valid_time =format_time(time.time() - t0)# Measure how long the validation run took. validation_time =format_time(time.time() - t0)# Record all statistics from this epoch. training_stats.append( {'epoch': epoch_id +1,'train_acc': train_avg_acc,'train_loss': train_avg_loss,'train_time': train_time,'valid_acc': valid_avg_acc,'valid_loss': valid_avg_loss,'valid_time': valid_time } )# === Save Model Parameters === bert_classifier.save_parameters('{}/net_epoch{}.params'.format(output_dir, epoch_id))
If you run the code, you will get the following result.
Total training time is approximately 12 minutes when using p3.8xlarge instances.
importosoutput_dir='./model_save/'# Create output directory if neededifnotos.path.exists(output_dir):os.makedirs(output_dir)print("Saving model to %s"%output_dir)training_stats= []step_num=0# Measure the total training time for the whole run.total_t0=time.time()num_epochs=1for epoch_id in range(num_epochs):# === Training phase ===# Measure how long the training epoch takes.t0=time.time() metric.reset() step_loss = 0 total_loss = 0 for batch_id, (token_ids,segment_ids,valid_length,label) in enumerate(train_dataloader):# Load the data to the GPUstoken_ids_=gluon.utils.split_and_load(token_ids,ctx,even_split=False)valid_length_=gluon.utils.split_and_load(valid_length,ctx,even_split=False)segment_ids_=gluon.utils.split_and_load(segment_ids,ctx,even_split=False)label_=gluon.utils.split_and_load(label,ctx,even_split=False)losses= []withautograd.record():for t, v, s, l in zip(token_ids_,valid_length_,segment_ids_,label_):# Forward computationout=bert_classifier(t,s,v.astype('float32'))ls=loss_function(out,l).mean()losses.append(ls)metric.update([l], [out])# Perform a backward pass to calculate the gradients for ls in losses:ls.backward() trainer.step(1)# sum losses over all devicesstep_loss+=sum([l.sum().asscalar() for l in losses])total_loss+=sum([l.sum().asscalar() for l in losses]) # Printing vital informationif (batch_id+1) % (log_interval) == 0:print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'.format(epoch_id,batch_id+1,len(train_dataloader),step_loss/log_interval,trainer.learning_rate,metric.get()[1]))step_loss=0train_avg_acc=metric.get()[1]train_avg_loss=total_loss/batch_idtotal_loss=0# Measure how long this epoch took.train_time=format_time(time.time() - t0)# === Validation phase ===# Measure how long the validation epoch takes. t0=time.time() valid_avg_acc,valid_avg_loss=evaluate_accuracy(bert_classifier,test_dataloader,ctx)# Measure how long this epoch took.valid_time=format_time(time.time() - t0)# Measure how long the validation run took.validation_time=format_time(time.time() - t0)# Record all statistics from this epoch.training_stats.append( {'epoch':epoch_id+1,'train_acc':train_avg_acc,'train_loss':train_avg_loss,'train_time':train_time,'valid_acc':valid_avg_acc,'valid_loss':valid_avg_loss,'valid_time':valid_time } ) # === Save Model Parameters ===bert_classifier.save_parameters('{}/net_epoch{}.params'.format(output_dir,epoch_id))
If you run the code, you will get the following result.
Overfitting occurs from the 4th epoch usually because the validation metrics are increasing while the training metrics are decreasing. Thus, we store the results of the 3rd epoch as final model parameters. The validation accuracy was pretty good with 89.6~89.7% accuracy, which is less than the 90.1% accuracy on the official site, but has not been hyperparameter tuned.
Even training only 1 epoch shows 88% accuracy, so we will convert it to the SageMaker for hands-on lab in the future.
After completing the training, compress the vocab file(.spiece) and model file into model.tar.gz and save it in Amazon S3 in order to create the SageMaker endpoint.
Step 3. Deploying to SageMaker Endpoint to perform Inference
A great tutorial has already been introduced in the AWS Korea AIML blog. Based on this method, it is easy to perform endpoint deployment by making minor modifications.
Basic contents can be done in the same way as for blogs. When editing Dockerfile(Based on ./docker/1.6.0/py3/Dockerfile.gpu), you need to edit as follows. (If you do not use KoGPT2, you can delete 4 lines below #For KoGPT2 installation.)
Next, the process of building Docker images, testing them, and then building them in ECR is the same as for blogs.
SageMaker
Now you can paste the script code below in the SageMaker notebook instance and then create the endpoint by specifying the script code as the entrypoint. The code example is shown below.
Note that the endpoint deployment time is about 9-11 minutes when using the GPU and about 7-8 minutes when using the CPU.