原文:https://tensorflow.google.cn/official_models/fine_tuning_bert
In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package.
The pretrained BERT model this tutorial is based on is also available on TensorFlow Hub, to see how to use it refer to the Hub Appendix
tf-models-official
is the stable Model Garden package. Note that it may not include the latest changes in thetensorflow_models
github repo. To include latest changes, you may installtf-models-nightly
, which is the nightly Model Garden package created daily automatically.- pip will install all models and dependencies automatically.
pip install -q tf-models-official==2.3.0
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
from official.modeling import tf_utils
from official import nlp
from official.nlp import bert
# Load the required submodules
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks
This directory contains the configuration, vocabulary, and a pre-trained checkpoint used in this tutorial:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"
tf.io.gfile.listdir(gs_folder_bert)
['bert_config.json',
'bert_model.ckpt.data-00000-of-00001',
'bert_model.ckpt.index',
'vocab.txt']
You can get a pre-trained BERT encoder from TensorFlow Hub:
hub_url_bert = "https://hub.tensorflow.google.cn/tensorflow/bert_en_uncased_L-12_H-768_A-12/2"
For this example we used the GLUE MRPC dataset from TFDS.
This dataset is not set up so that it can be directly fed into the BERT model, so this section also handles the necessary preprocessing.
The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.
- Number of labels: 2.
- Size of training dataset: 3668.
- Size of evaluation dataset: 408.
- Maximum sequence length of training and evaluation dataset: 128.
glue, info = tfds.load('glue/mrpc', with_info=True,
# It's small, load the whole dataset
batch_size=-1)
Downloading and preparing dataset glue/mrpc/1.0.0 (download: 1.43 MiB, generated: Unknown size, total: 1.43 MiB) to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0...
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-train.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-validation.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-test.tfrecord
Dataset glue downloaded and prepared to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0\. Subsequent calls will reuse this data.
list(glue.keys())
['test', 'train', 'validation']
The info
object describes the dataset and it's features:
info.features
FeaturesDict({
'idx': tf.int32,
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'sentence1': Text(shape=(), dtype=tf.string),
'sentence2': Text(shape=(), dtype=tf.string),
})
The two classes are:
info.features['label'].names
['not_equivalent', 'equivalent']
Here is one example from the training set:
glue_train = glue['train']
for key, value in glue_train.items():
print(f"{key:9s}: {value[0].numpy()}")
idx : 1680
label : 0
sentence1: b'The identical rovers will act as robotic geologists , searching for evidence of past water .'
sentence2: b'The rovers act as robotic geologists , moving on six wheels .'
To fine tune a pre-trained model you need to be sure that you're using exactly the same tokenization, vocabulary, and index mapping as you used during training.
The BERT tokenizer used in this tutorial is written in pure Python (It's not built out of TensorFlow ops). So you can't just plug it into your model as a keras.layer
like you can with preprocessing.TextVectorization
.
The following code rebuilds the tokenizer that was used by the base model:
# Set up tokenizer to generate Tensorflow dataset
tokenizer = bert.tokenization.FullTokenizer(
vocab_file=os.path.join(gs_folder_bert, "vocab.txt"),
do_lower_case=True)
print("Vocab size:", len(tokenizer.vocab))
Vocab size: 30522
Tokenize a sentence:
tokens = tokenizer.tokenize("Hello TensorFlow!")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
['hello', 'tensor', '##flow', '!']
[7592, 23435, 12314, 999]
The section manually preprocessed the dataset into the format expected by the model.
This dataset is small, so preprocessing can be done quickly and easily in memory. For larger datasets the tf_models
library includes some tools for preprocessing and re-serializing a dataset. See Appendix: Re-encoding a large dataset for details.
The model expects its two inputs sentences to be concatenated together. This input is expected to start with a [CLS]
"This is a classification problem" token, and each sentence should end with a [SEP]
"Separator" token:
tokenizer.convert_tokens_to_ids(['[CLS]', '[SEP]'])
[101, 102]
Start by encoding all the sentences while appending a [SEP]
token, and packing them into ragged-tensors:
def encode_sentence(s):
tokens = list(tokenizer.tokenize(s.numpy()))
tokens.append('[SEP]')
return tokenizer.convert_tokens_to_ids(tokens)
sentence1 = tf.ragged.constant([
encode_sentence(s) for s in glue_train["sentence1"]])
sentence2 = tf.ragged.constant([
encode_sentence(s) for s in glue_train["sentence2"]])
print("Sentence1 shape:", sentence1.shape.as_list())
print("Sentence2 shape:", sentence2.shape.as_list())
Sentence1 shape: [3668, None]
Sentence2 shape: [3668, None]
Now prepend a [CLS]
token, and concatenate the ragged tensors to form a single input_word_ids
tensor for each example. RaggedTensor.to_tensor()
zero pads to the longest sequence.
cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
_ = plt.pcolormesh(input_word_ids.to_tensor())
The model expects two additional inputs:
- The input mask
- The input type
The mask allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the input_word_ids
, and contains a 1
anywhere the input_word_ids
is not padding.
input_mask = tf.ones_like(input_word_ids).to_tensor()
plt.pcolormesh(input_mask)
<matplotlib.collections.QuadMesh at 0x7fad1c07ed30>
The "input type" also has the same shape, but inside the non-padded region, contains a 0
or a 1
indicating which sentence the token is a part of.
type_cls = tf.zeros_like(cls)
type_s1 = tf.zeros_like(sentence1)
type_s2 = tf.ones_like(sentence2)
input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor()
plt.pcolormesh(input_type_ids)
<matplotlib.collections.QuadMesh at 0x7fad143c1710>
Collect the above text parsing code into a single function, and apply it to each split of the glue/mrpc
dataset.
def encode_sentence(s, tokenizer):
tokens = list(tokenizer.tokenize(s))
tokens.append('[SEP]')
return tokenizer.convert_tokens_to_ids(tokens)
def bert_encode(glue_dict, tokenizer):
num_examples = len(glue_dict["sentence1"])
sentence1 = tf.ragged.constant([
encode_sentence(s, tokenizer)
for s in np.array(glue_dict["sentence1"])])
sentence2 = tf.ragged.constant([
encode_sentence(s, tokenizer)
for s in np.array(glue_dict["sentence2"])])
cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
input_mask = tf.ones_like(input_word_ids).to_tensor()
type_cls = tf.zeros_like(cls)
type_s1 = tf.zeros_like(sentence1)
type_s2 = tf.ones_like(sentence2)
input_type_ids = tf.concat(
[type_cls, type_s1, type_s2], axis=-1).to_tensor()
inputs = {
'input_word_ids': input_word_ids.to_tensor(),
'input_mask': input_mask,
'input_type_ids': input_type_ids}
return inputs
glue_train = bert_encode(glue['train'], tokenizer)
glue_train_labels = glue['train']['label']
glue_validation = bert_encode(glue['validation'], tokenizer)
glue_validation_labels = glue['validation']['label']
glue_test = bert_encode(glue['test'], tokenizer)
glue_test_labels = glue['test']['label']
Each subset of the data has been converted to a dictionary of features, and a set of labels. Each feature in the input dictionary has the same shape, and the number of labels should match:
for key, value in glue_train.items():
print(f'{key:15s} shape: {value.shape}')
print(f'glue_train_labels shape: {glue_train_labels.shape}')
input_word_ids shape: (3668, 103)
input_mask shape: (3668, 103)
input_type_ids shape: (3668, 103)
glue_train_labels shape: (3668,)
The first step is to download the configuration for the pre-trained model.
import json
bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())
bert_config = bert.configs.BertConfig.from_dict(config_dict)
config_dict
{'attention_probs_dropout_prob': 0.1,
'hidden_act': 'gelu',
'hidden_dropout_prob': 0.1,
'hidden_size': 768,
'initializer_range': 0.02,
'intermediate_size': 3072,
'max_position_embeddings': 512,
'num_attention_heads': 12,
'num_hidden_layers': 12,
'type_vocab_size': 2,
'vocab_size': 30522}
The config
defines the core BERT Model, which is a Keras model to predict the outputs of num_classes
from the inputs with maximum sequence length max_seq_length
.
This function returns both the encoder and the classifier.
bert_classifier, bert_encoder = bert.bert_models.classifier_model(
bert_config, num_labels=2)
The classifier has three inputs and one output:
tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)
Run it on a test batch of data 10 examples from the training set. The output is the logits for the two classes:
glue_batch = {key: val[:10] for key, val in glue_train.items()}
bert_classifier(
glue_batch, training=True
).numpy()
array([[ 0.08382261, 0.34465584],
[ 0.02057236, 0.24053624],
[ 0.04930754, 0.1117427 ],
[ 0.17041089, 0.20810834],
[ 0.21667874, 0.2840511 ],
[ 0.02325345, 0.33799925],
[-0.06198866, 0.13532838],
[ 0.084592 , 0.20711854],
[-0.04323687, 0.17096342],
[ 0.23759182, 0.16801538]], dtype=float32)
The TransformerEncoder
in the center of the classifier above is the bert_encoder
.
Inspecting the encoder, we see its stack of Transformer
layers connected to those same three inputs:
tf.keras.utils.plot_model(bert_encoder, show_shapes=True, dpi=48)
When built the encoder is randomly initialized. Restore the encoder's weights from the checkpoint:
checkpoint = tf.train.Checkpoint(model=bert_encoder)
checkpoint.restore(
os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()
<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fad4580ffd0>
Note: The pretrained TransformerEncoder
is also available on TensorFlow Hub. See the Hub appendix for details.
BERT adopts the Adam optimizer with weight decay (aka "AdamW"). It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0.
# Set up epochs and steps
epochs = 3
batch_size = 32
eval_batch_size = 32
train_data_size = len(glue_train_labels)
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)
# creates an optimizer with learning rate schedule
optimizer = nlp.optimization.create_optimizer(
2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)
This returns an AdamWeightDecay
optimizer with the learning rate schedule set:
type(optimizer)
official.nlp.optimization.AdamWeightDecay
To see an example of how to customize the optimizer and it's schedule, see the Optimizer schedule appendix.
The metric is accuracy and we use sparse categorical cross-entropy as loss.
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
bert_classifier.compile(
optimizer=optimizer,
loss=loss,
metrics=metrics)
bert_classifier.fit(
glue_train, glue_train_labels,
validation_data=(glue_validation, glue_validation_labels),
batch_size=32,
epochs=epochs)
Epoch 1/3
115/115 [==============================] - 26s 222ms/step - loss: 0.6151 - accuracy: 0.6611 - val_loss: 0.5462 - val_accuracy: 0.7451
Epoch 2/3
115/115 [==============================] - 24s 212ms/step - loss: 0.4447 - accuracy: 0.8010 - val_loss: 0.4150 - val_accuracy: 0.8309
Epoch 3/3
115/115 [==============================] - 24s 213ms/step - loss: 0.2830 - accuracy: 0.8964 - val_loss: 0.3697 - val_accuracy: 0.8480
<tensorflow.python.keras.callbacks.History at 0x7fad000ebda0>
Now run the fine-tuned model on a custom example to see that it works.
Start by encoding some sentence pairs:
my_examples = bert_encode(
glue_dict = {
'sentence1':[
'The rain in Spain falls mainly on the plain.',
'Look I fine tuned BERT.'],
'sentence2':[
'It mostly rains on the flat lands of Spain.',
'Is it working? This does not match.']
},
tokenizer=tokenizer)
The model should report class 1
"match" for the first example and class 0
"no-match" for the second:
result = bert_classifier(my_examples, training=False)
result = tf.argmax(result).numpy()
result
array([1, 0])
np.array(info.features['label'].names)[result]
array(['equivalent', 'not_equivalent'], dtype='<U14')
Often the goal of training a model is to use it for something, so export the model and then restore it to be sure that it works.
export_dir='./saved_model'
tf.saved_model.save(bert_classifier, export_dir=export_dir)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: ./saved_model/assets
INFO:tensorflow:Assets written to: ./saved_model/assets
reloaded = tf.saved_model.load(export_dir)
reloaded_result = reloaded([my_examples['input_word_ids'],
my_examples['input_mask'],
my_examples['input_type_ids']], training=False)
original_result = bert_classifier(my_examples, training=False)
# The results are (nearly) identical:
print(original_result.numpy())
print()
print(reloaded_result.numpy())
[[-0.95450354 1.1227685 ]
[ 0.40344787 -0.58954155]]
[[-0.95450354 1.1227684 ]
[ 0.4034478 -0.5895414 ]]
This tutorial you re-encoded the dataset in memory, for clarity.
This was only possible because glue/mrpc
is a very small dataset. To deal with larger datasets tf_models
library includes some tools for processing and re-encoding a dataset for efficient training.
The first step is to describe which features of the dataset should be transformed:
processor = nlp.data.classifier_data_lib.TfdsProcessor(
tfds_params="dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2",
process_text_fn=bert.tokenization.convert_to_unicode)
Then apply the transformation to generate new TFRecord files.
# Set up output of training and evaluation Tensorflow dataset
train_data_output_path="./mrpc_train.tf_record"
eval_data_output_path="./mrpc_eval.tf_record"
max_seq_length = 128
batch_size = 32
eval_batch_size = 32
# Generate and save training data into a tf record file
input_meta_data = (
nlp.data.classifier_data_lib.generate_tf_record_from_data_file(
processor=processor,
data_dir=None, # It is `None` because data is from tfds, not local dir.
tokenizer=tokenizer,
train_data_output_path=train_data_output_path,
eval_data_output_path=eval_data_output_path,
max_seq_length=max_seq_length))
Finally create tf.data
input pipelines from those TFRecord files:
training_dataset = bert.run_classifier.get_dataset_fn(
train_data_output_path,
max_seq_length,
batch_size,
is_training=True)()
evaluation_dataset = bert.run_classifier.get_dataset_fn(
eval_data_output_path,
max_seq_length,
eval_batch_size,
is_training=False)()
The resulting tf.data.Datasets
return (features, labels)
pairs, as expected by keras.Model.fit
:
training_dataset.element_spec
({'input_word_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
'input_mask': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
'input_type_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None)},
TensorSpec(shape=(32,), dtype=tf.int32, name=None))
If you need to modify the data loading here is some code to get you started:
def create_classifier_dataset(file_path, seq_length, batch_size, is_training):
"""Creates input dataset from (tf)records files for train/eval."""
dataset = tf.data.TFRecordDataset(file_path)
if is_training:
dataset = dataset.shuffle(100)
dataset = dataset.repeat()
def decode_record(record):
name_to_features = {
'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),
'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
'label_ids': tf.io.FixedLenFeature([], tf.int64),
}
return tf.io.parse_single_example(record, name_to_features)
def _select_data_from_record(record):
x = {
'input_word_ids': record['input_ids'],
'input_mask': record['input_mask'],
'input_type_ids': record['segment_ids']
}
y = record['label_ids']
return (x, y)
dataset = dataset.map(decode_record,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.map(
_select_data_from_record,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.batch(batch_size, drop_remainder=is_training)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset
# Set up batch sizes
batch_size = 32
eval_batch_size = 32
# Return Tensorflow dataset
training_dataset = create_classifier_dataset(
train_data_output_path,
input_meta_data['max_seq_length'],
batch_size,
is_training=True)
evaluation_dataset = create_classifier_dataset(
eval_data_output_path,
input_meta_data['max_seq_length'],
eval_batch_size,
is_training=False)
training_dataset.element_spec
({'input_word_ids': TensorSpec(shape=(32, 128), dtype=tf.int64, name=None),
'input_mask': TensorSpec(shape=(32, 128), dtype=tf.int64, name=None),
'input_type_ids': TensorSpec(shape=(32, 128), dtype=tf.int64, name=None)},
TensorSpec(shape=(32,), dtype=tf.int64, name=None))
You can get the BERT model off the shelf from TFHub. It would not be hard to add a classification head on top of this hub.KerasLayer
# Note: 350MB download.
import tensorflow_hub as hub
hub_model_name = "bert_en_uncased_L-12_H-768_A-12"
hub_encoder = hub.KerasLayer(f"https://hub.tensorflow.google.cn/tensorflow/{hub_model_name}/2",
trainable=True)
print(f"The Hub encoder has {len(hub_encoder.trainable_variables)} trainable variables")
The Hub encoder has 199 trainable variables
Test run it on a batch of data:
result = hub_encoder(
inputs=[glue_train['input_word_ids'][:10],
glue_train['input_mask'][:10],
glue_train['input_type_ids'][:10],],
training=False,
)
print("Pooled output shape:", result[0].shape)
print("Sequence output shape:", result[1].shape)
Pooled output shape: (10, 768)
Sequence output shape: (10, 103, 768)
At this point it would be simple to add a classification head yourself.
The bert_models.classifier_model
function can also build a classifier onto the encoder from TensorFlow Hub:
hub_classifier, hub_encoder = bert.bert_models.classifier_model(
# Caution: Most of `bert_config` is ignored if you pass a hub url.
bert_config=bert_config, hub_module_url=hub_url_bert, num_labels=2)
The one downside to loading this model from TFHub is that the structure of internal keras layers is not restored. So it's more difficult to inspect or modify the model. The TransformerEncoder
model is now a single layer:
tf.keras.utils.plot_model(hub_classifier, show_shapes=True, dpi=64)
try:
tf.keras.utils.plot_model(hub_encoder, show_shapes=True, dpi=64)
assert False
except Exception as e:
print(f"{type(e).__name__}: {e}")
AttributeError: 'KerasLayer' object has no attribute 'layers'
If you need a more control over the construction of the model it's worth noting that the classifier_model
function used earlier is really just a thin wrapper over the nlp.modeling.networks.TransformerEncoder
and nlp.modeling.models.BertClassifier
classes. Just remember that if you start modifying the architecture it may not be correct or possible to reload the pre-trained checkpoint so you'll need to retrain from scratch.
Build the encoder:
transformer_config = config_dict.copy()
# You need to rename a few fields to make this work:
transformer_config['attention_dropout_rate'] = transformer_config.pop('attention_probs_dropout_prob')
transformer_config['activation'] = tf_utils.get_activation(transformer_config.pop('hidden_act'))
transformer_config['dropout_rate'] = transformer_config.pop('hidden_dropout_prob')
transformer_config['initializer'] = tf.keras.initializers.TruncatedNormal(
stddev=transformer_config.pop('initializer_range'))
transformer_config['max_sequence_length'] = transformer_config.pop('max_position_embeddings')
transformer_config['num_layers'] = transformer_config.pop('num_hidden_layers')
transformer_config
{'hidden_size': 768,
'intermediate_size': 3072,
'num_attention_heads': 12,
'type_vocab_size': 2,
'vocab_size': 30522,
'attention_dropout_rate': 0.1,
'activation': <function official.modeling.activations.gelu.gelu(x)>,
'dropout_rate': 0.1,
'initializer': <tensorflow.python.keras.initializers.initializers_v2.TruncatedNormal at 0x7fac08046e10>,
'max_sequence_length': 512,
'num_layers': 12}
manual_encoder = nlp.modeling.networks.TransformerEncoder(**transformer_config)
Restore the weights:
checkpoint = tf.train.Checkpoint(model=manual_encoder)
checkpoint.restore(
os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()
<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fabefa596d8>
Test run it:
result = manual_encoder(my_examples, training=True)
print("Sequence output shape:", result[0].shape)
print("Pooled output shape:", result[1].shape)
Sequence output shape: (2, 23, 768)
Pooled output shape: (2, 768)
Wrap it in a classifier:
manual_classifier = nlp.modeling.models.BertClassifier(
bert_encoder,
num_classes=2,
dropout_rate=transformer_config['dropout_rate'],
initializer=tf.keras.initializers.TruncatedNormal(
stddev=bert_config.initializer_range))
manual_classifier(my_examples, training=True).numpy()
array([[ 0.07863025, -0.02940944],
[ 0.30274656, 0.27299827]], dtype=float32)
The optimizer used to train the model was created using the nlp.optimization.create_optimizer
function:
optimizer = nlp.optimization.create_optimizer(
2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)
That high level wrapper sets up the learning rate schedules and the optimizer.
The base learning rate schedule used here is a linear decay to zero over the training run:
epochs = 3
batch_size = 32
eval_batch_size = 32
train_data_size = len(glue_train_labels)
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
decay_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
initial_learning_rate=2e-5,
decay_steps=num_train_steps,
end_learning_rate=0)
plt.plot([decay_schedule(n) for n in range(num_train_steps)])
[<matplotlib.lines.Line2D at 0x7fabef5e69e8>]
This, in turn is wrapped in a WarmUp
schedule that linearly increases the learning rate to the target value over the first 10% of training:
warmup_steps = num_train_steps * 0.1
warmup_schedule = nlp.optimization.WarmUp(
initial_learning_rate=2e-5,
decay_schedule_fn=decay_schedule,
warmup_steps=warmup_steps)
# The warmup overshoots, because it warms up to the `initial_learning_rate`
# following the original implementation. You can set
# `initial_learning_rate=decay_schedule(warmup_steps)` if you don't like the
# overshoot.
plt.plot([warmup_schedule(n) for n in range(num_train_steps)])
[<matplotlib.lines.Line2D at 0x7fabef559630>]
Then create the nlp.optimization.AdamWeightDecay
using that schedule, configured for the BERT model:
optimizer = nlp.optimization.AdamWeightDecay(
learning_rate=warmup_schedule,
weight_decay_rate=0.01,
epsilon=1e-6,
exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias'])