作者: Jesse Chan
建立日期 2022/07/25
上次修改日期 2022/07/25
說明: 使用 KerasHub 訓練用於文字生成的迷你 GPT 模型。
在此範例中,我們將使用 KerasHub 建構一個縮小版的生成式預訓練 (GPT) 模型。GPT 是一種基於 Transformer 的模型,可讓您從提示產生複雜的文字。
我們將使用 simplebooks-92 語料庫訓練模型,這是一個由幾本小說組成的資料集。對於此範例來說,這是一個很好的資料集,因為它具有較小的詞彙量和較高的詞頻,這對於訓練參數較少的模型有利。
此範例結合了使用微型 GPT 生成文字和 KerasHub 抽象的概念。我們將展示 KerasHub 詞語化、層和指標如何簡化訓練流程,然後展示如何使用 KerasHub 取樣實用程式產生輸出文字。
注意:如果您在 Colab 上執行此範例,請務必啟用 GPU 執行階段以加快訓練速度。
此範例需要 KerasHub。您可以使用以下命令安裝:pip install keras-hub
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras # Upgrade to Keras 3.
import os
import keras_hub
import keras
import tensorflow.data as tf_data
import tensorflow.strings as tf_strings
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512 # Strings shorter than this will be discarded
SEQ_LEN = 128 # Length of training sequences, in tokens
# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000 # Limits parameters in model.
# Training
EPOCHS = 5
# Inference
NUM_TOKENS_TO_GENERATE = 80
現在,讓我們下載資料集!SimpleBooks 資料集包含 1,573 本古騰堡書籍,並且具有最小的詞彙量與詞層級權杖比率之一。它的詞彙量約為 98k,是 WikiText-103 的三分之一,權杖數量大約相同 (~100M)。這使得它很容易適合小型模型。
keras.utils.get_file(
origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
.shuffle(buffer_size=256)
)
# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
)
Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
282386239/282386239 ━━━━━━━━━━━━━━━━━━━━ 7s 0us/step
我們從訓練資料集訓練詞語化器,以取得 VOCAB_SIZE
的詞彙量,這是一個調整過的超參數。我們希望盡可能限制詞彙量,因為我們稍後會看到它對模型參數的數量有很大的影響。我們也不想包含太少詞彙項目,否則會有太多超出詞彙 (OOV) 的子詞。此外,詞彙表中保留了三個權杖
"[PAD]"
用於將序列填補到 SEQ_LEN
。由於 WordPieceTokenizer
(和其他層) 將 0
/vocab[0]
視為預設填補,因此此權杖在 reserved_tokens
和 vocab
中的索引都是 0。"[UNK]"
用於 OOV 子詞,其應與 WordPieceTokenizer
中的預設 oov_token="[UNK]"
相符。"[BOS]"
代表句子開頭,但在此處技術上它是代表每一行訓練資料開頭的權杖。# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
raw_train_ds,
vocabulary_size=VOCAB_SIZE,
lowercase=True,
reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)
我們使用詞彙資料初始化 keras_hub.tokenizers.WordPieceTokenizer
。WordPieceTokenizer 是 BERT 和其他模型使用的 WordPiece 演算法的有效實作。它將執行剝除、小寫和其他不可逆轉的預處理操作。
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
vocabulary=vocab,
sequence_length=SEQ_LEN,
lowercase=True,
)
我們透過將資料集詞語化並將其分割為 features
和 labels
來預處理資料集。
# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
sequence_length=SEQ_LEN,
start_value=tokenizer.token_to_id("[BOS]"),
)
def preprocess(inputs):
outputs = tokenizer(inputs)
features = start_packer(outputs)
labels = outputs
return features, labels
# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)
我們使用下列各層建立縮小版的 GPT 模型
keras_hub.layers.TokenAndPositionEmbedding
層,它結合了權杖及其位置的嵌入。keras_hub.layers.TransformerDecoder
層,具有預設的因果遮罩。該層在僅使用解碼器序列執行時,沒有交叉注意。inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
vocabulary_size=VOCAB_SIZE,
sequence_length=SEQ_LEN,
embedding_dim=EMBED_DIM,
mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
decoder_layer = keras_hub.layers.TransformerDecoder(
num_heads=NUM_HEADS,
intermediate_dim=FEED_FORWARD_DIM,
)
x = decoder_layer(x) # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])
讓我們看看我們的模型摘要 - 大部分參數都在 token_and_position_embedding
和輸出 dense
層中!這表示詞彙量 (VOCAB_SIZE
) 對模型的大小有很大的影響,而 Transformer 解碼器層的數量 (NUM_LAYERS
) 則沒有那麼大的影響。
model.summary()
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, None) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ token_and_position_embedding │ (None, None, 256) │ 1,312,768 │ │ (TokenAndPositionEmbedding) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ transformer_decoder │ (None, None, 256) │ 329,085 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ transformer_decoder_1 │ (None, None, 256) │ 329,085 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense (Dense) │ (None, None, 5000) │ 1,285,000 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 3,255,938 (12.42 MB)
Trainable params: 3,255,938 (12.42 MB)
Non-trainable params: 0 (0.00 B)
現在我們有了模型,讓我們使用 fit()
方法訓練它。
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
Epoch 1/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 216s 66ms/step - loss: 5.0008 - perplexity: 180.0715 - val_loss: 4.2176 - val_perplexity: 68.0438
Epoch 2/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 127s 48ms/step - loss: 4.1699 - perplexity: 64.7740 - val_loss: 4.0553 - val_perplexity: 57.7996
Epoch 3/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 126s 47ms/step - loss: 4.0286 - perplexity: 56.2138 - val_loss: 4.0134 - val_perplexity: 55.4446
Epoch 4/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 134s 50ms/step - loss: 3.9576 - perplexity: 52.3643 - val_loss: 3.9900 - val_perplexity: 54.1153
Epoch 5/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 135s 51ms/step - loss: 3.9080 - perplexity: 49.8242 - val_loss: 3.9500 - val_perplexity: 52.0006
<keras.src.callbacks.history.History at 0x7f7de0365ba0>
透過我們訓練的模型,我們可以對其進行測試以評估其效能。為此,我們可以從以 "[BOS]"
權杖開頭的輸入序列為我們的模型提供種子,並透過在迴圈中為每個後續權杖進行預測來逐步取樣模型。
首先,讓我們建構一個與我們模型輸入具有相同形狀的提示,其中僅包含 "[BOS]"
權杖。
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens
<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
dtype=int32)>
我們將使用 keras_hub.samplers
模組進行推論,這需要一個包裝我們剛訓練模型的回呼函式。此包裝函式呼叫模型,並傳回我們正在產生的目前權杖的邏輯預測。
注意:在定義回呼時,有兩個更進階的功能可用。第一個是能夠接收先前產生步驟中計算的狀態 cache
,這可用於加速產生。第二個是能夠輸出每個產生權杖的最終密集「隱藏狀態」。這由 keras_hub.samplers.ContrastiveSampler
使用,透過懲罰重複的隱藏狀態來避免重複。兩者都是選用的,我們現在將忽略它們。
def next(prompt, cache, index):
logits = model(prompt)[:, index - 1, :]
# Ignore hidden states for now; only needed for contrastive search.
hidden_states = None
return logits, hidden_states, cache
建立包裝函式是使用這些函式最複雜的部分。現在它完成了,讓我們測試不同的實用程式,從貪婪搜尋開始。
我們在每個時間步驟貪婪地選取最有可能的權杖。換句話說,我們取得模型輸出的 argmax。
sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1, # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")
Greedy search generated text:
[b'[BOS] " i \' m going to tell you , " said the boy , " i \' ll tell you , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good']
如您所見,貪婪搜尋開始時還有些道理,但很快就開始重複自己。這是文字產生中常見的問題,可以使用稍後顯示的某些機率文字產生實用程式來修正!
在較高的層級,束搜尋會在每個時間步驟追蹤 num_beams
個最有可能的序列,並從所有序列中預測最佳的下一個權杖。它是對貪婪搜尋的改進,因為它儲存了更多可能性。但是,它的效率不如貪婪搜尋,因為它必須計算和儲存多個潛在序列。
注意: num_beams=1
的束搜尋與貪婪搜尋相同。
sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")
Beam search generated text:
[b'[BOS] " i don \' t know anything about it , " she said . " i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \'']
與貪婪搜尋類似,束搜尋很快就開始重複自己,因為它仍然是一種確定性方法。
隨機搜尋是我們的第一個機率方法。在每個時間步驟,它使用模型提供的 softmax 機率取樣下一個權杖。
sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")
Random search generated text:
[b'[BOS] eleanor . like ice , not children would have suspicious forehead . they will see him , no goods in her plums . i have made a stump one , on the occasion , - - it is sacred , and one is unholy - plaything - - the partial consequences , and one refuge in a style of a boy , who was his grandmother . it was a young gentleman who bore off upon the middle of the day , rush and as he maltreated the female society , were growing at once . in and out of the craid little plays , stopping']
瞧,沒有重複了!但是,使用隨機搜尋,我們可能會看到一些沒有意義的詞彙出現,因為詞彙表中的任何詞彙都有機會使用此取樣方法出現。這會由我們的下一個搜尋實用程式頂端 K 搜尋修正。
與隨機搜尋類似,我們會從模型提供的機率分配中取樣下一個權杖。唯一的不同之處在於,在此處,我們會選取最有可能的前 k
個權杖,並在取樣之前將機率質量分配到它們上。這樣,我們就不會從低機率權杖取樣,因此我們將擁有較少沒有意義的詞彙!
sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")
Top-K search generated text:
[b'[BOS] " the young man was not the one , and the boy went away to the green forest . they were a little girl \' s wife , and the child loved him as much as he did , and he had often heard of a little girl who lived near the house . they were too tired to go , and when they went down to the barns and get into the barn , and they got the first of the barns that they had been taught to do so , and the little people went to their homes . she did , she told them that she had been a very clever , and they had made the first . she knew they']
即使使用頂端 K 搜尋,仍有一些地方需要改進。使用頂端 K 搜尋時,數字 k
是固定的,這表示它會為任何機率分配選取相同數量的權杖。考慮兩種情境,一種是機率質量集中在 2 個詞彙上,另一種是機率質量均勻地集中在 10 個詞彙上。我們應該選擇 k=2
還是 k=10
?這裡沒有一種尺寸適合所有 k
。
這就是頂端 P 搜尋的用武之地!我們不選擇 k
,而是選擇一個機率 p
,我們希望頂端權杖的機率總和達到這個值。這樣,我們可以根據機率分配動態調整 k
。透過設定 p=0.9
,如果 90% 的機率質量集中在前 2 個權杖上,我們可以篩選出前 2 個權杖進行取樣。如果 90% 反而分佈在 10 個權杖上,它也會類似地篩選出前 10 個權杖進行取樣。
sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")
Top-P search generated text:
[b'[BOS] the children were both born in the spring , and the youngest sister were very much like the other children , but they did not see them . they were very happy , and their mother was a beautiful one . the youngest was one of the youngest sister of the youngest , and the youngest baby was very fond of the children . when they came home , they would see a little girl in the house , and had the beautiful family , and the children of the children had to sit and look on their backs , and the eldest children were very long , and they were so bright and happy , as they were , they had never noticed their hair ,']
我們也可以將實用程式包裝在回呼中,這可讓您為模型的每個時期列印出預測序列!以下是頂端 K 搜尋的回呼範例
class TopKTextGenerator(keras.callbacks.Callback):
"""A callback to generate text from a trained model using top-k."""
def __init__(self, k):
self.sampler = keras_hub.samplers.TopKSampler(k)
def on_epoch_end(self, epoch, logs=None):
output_tokens = self.sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")
text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])
Epoch 1/2
Top-K search generated text:
[b"[BOS] the young man was in the middle of a month , and he was able to take the crotch , but a long time , for he felt very well for himself in the sepoys ' s hands were chalks . he was the only boy , and he had a few years before been married , and the man said he was a tall one . he was a very handsome , and he was a very handsome young fellow , and a handsome , noble young man , but a boy , and man . he was a very handsome man , and was tall and handsome , and he looked like a gentleman . he was an"]
1/1 - 16s - 16s/step - loss: 3.9454 - perplexity: 51.6987
Epoch 2/2
Top-K search generated text:
[b'[BOS] " well , it is true . it is true that i should go to the house of a collector , in the matter of prussia that there is no other way there . there is no chance of being in the habit of being in the way of an invasion . i know not what i have done , but i have seen the man in the middle of a day . the next morning i shall take him to my father , for i am not the very day of the town , which would have been a little more than the one \' s daughter , i think it over and the whole affair will be']
1/1 - 17s - 17s/step - loss: 3.7860 - perplexity: 44.0932
<keras.src.callbacks.history.History at 0x7f7de0325600>
總結來說,在這個範例中,我們使用 KerasHub 層來訓練子詞詞彙表、將訓練資料標記化、建立一個迷你 GPT 模型,並使用文字生成函式庫執行推論。
如果您想了解 Transformer 的運作方式,或想進一步了解如何訓練完整的 GPT 模型,以下是一些延伸閱讀資料: