► 程式碼範例 / 生成式深度學習 / 使用 KerasHub 從頭開始生成 GPT 文字

使用 KerasHub 從頭開始生成 GPT 文字

作者： Jesse Chan
建立日期 2022/07/25
上次修改日期 2022/07/25
說明： 使用 KerasHub 訓練用於文字生成的迷你 GPT 模型。

ⓘ 此範例使用 Keras 3

在 Colab 中檢視 • GitHub 來源

簡介

在此範例中，我們將使用 KerasHub 建構一個縮小版的生成式預訓練 (GPT) 模型。GPT 是一種基於 Transformer 的模型，可讓您從提示產生複雜的文字。

我們將使用 simplebooks-92 語料庫訓練模型，這是一個由幾本小說組成的資料集。對於此範例來說，這是一個很好的資料集，因為它具有較小的詞彙量和較高的詞頻，這對於訓練參數較少的模型有利。

此範例結合了使用微型 GPT 生成文字和 KerasHub 抽象的概念。我們將展示 KerasHub 詞語化、層和指標如何簡化訓練流程，然後展示如何使用 KerasHub 取樣實用程式產生輸出文字。

注意：如果您在 Colab 上執行此範例，請務必啟用 GPU 執行階段以加快訓練速度。

此範例需要 KerasHub。您可以使用以下命令安裝：pip install keras-hub

設定

!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras  # Upgrade to Keras 3.

import os
import keras_hub
import keras

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

設定和超參數

# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 5

# Inference
NUM_TOKENS_TO_GENERATE = 80

載入資料

現在，讓我們下載資料集！SimpleBooks 資料集包含 1,573 本古騰堡書籍，並且具有最小的詞彙量與詞層級權杖比率之一。它的詞彙量約為 98k，是 WikiText-103 的三分之一，權杖數量大約相同 (~100M)。這使得它很容易適合小型模型。

keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
 282386239/282386239 ━━━━━━━━━━━━━━━━━━━━ 7s 0us/step

訓練詞語化器

我們從訓練資料集訓練詞語化器，以取得 VOCAB_SIZE 的詞彙量，這是一個調整過的超參數。我們希望盡可能限制詞彙量，因為我們稍後會看到它對模型參數的數量有很大的影響。我們也不想包含太少詞彙項目，否則會有太多超出詞彙 (OOV) 的子詞。此外，詞彙表中保留了三個權杖

"[PAD]" 用於將序列填補到 SEQ_LEN。由於 WordPieceTokenizer (和其他層) 將 0/vocab[0] 視為預設填補，因此此權杖在 reserved_tokens 和 vocab 中的索引都是 0。
"[UNK]" 用於 OOV 子詞，其應與 WordPieceTokenizer 中的預設 oov_token="[UNK]" 相符。
"[BOS]" 代表句子開頭，但在此處技術上它是代表每一行訓練資料開頭的權杖。

# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

載入詞語化器

我們使用詞彙資料初始化 keras_hub.tokenizers.WordPieceTokenizer。WordPieceTokenizer 是 BERT 和其他模型使用的 WordPiece 演算法的有效實作。它將執行剝除、小寫和其他不可逆轉的預處理操作。

tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

詞語化資料

我們透過將資料集詞語化並將其分割為 features 和 labels 來預處理資料集。

# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)

建構模型

我們使用下列各層建立縮小版的 GPT 模型

一個 keras_hub.layers.TokenAndPositionEmbedding 層，它結合了權杖及其位置的嵌入。
多個 keras_hub.layers.TransformerDecoder 層，具有預設的因果遮罩。該層在僅使用解碼器序列執行時，沒有交叉注意。
一個最終的密集線性層

inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_hub.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

讓我們看看我們的模型摘要 - 大部分參數都在 token_and_position_embedding 和輸出 dense 層中！這表示詞彙量 (VOCAB_SIZE) 對模型的大小有很大的影響，而 Transformer 解碼器層的數量 (NUM_LAYERS) 則沒有那麼大的影響。

model.summary()

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, None)              │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ token_and_position_embedding    │ (None, None, 256)         │  1,312,768 │
│ (TokenAndPositionEmbedding)     │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder             │ (None, None, 256)         │    329,085 │
│ (TransformerDecoder)            │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder_1           │ (None, None, 256)         │    329,085 │
│ (TransformerDecoder)            │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense (Dense)                   │ (None, None, 5000)        │  1,285,000 │
└─────────────────────────────────┴───────────────────────────┴────────────┘

 Total params: 3,255,938 (12.42 MB)

 Trainable params: 3,255,938 (12.42 MB)

 Non-trainable params: 0 (0.00 B)

訓練

現在我們有了模型，讓我們使用 fit() 方法訓練它。

model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 216s 66ms/step - loss: 5.0008 - perplexity: 180.0715 - val_loss: 4.2176 - val_perplexity: 68.0438
Epoch 2/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 127s 48ms/step - loss: 4.1699 - perplexity: 64.7740 - val_loss: 4.0553 - val_perplexity: 57.7996
Epoch 3/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 126s 47ms/step - loss: 4.0286 - perplexity: 56.2138 - val_loss: 4.0134 - val_perplexity: 55.4446
Epoch 4/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 134s 50ms/step - loss: 3.9576 - perplexity: 52.3643 - val_loss: 3.9900 - val_perplexity: 54.1153
Epoch 5/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 135s 51ms/step - loss: 3.9080 - perplexity: 49.8242 - val_loss: 3.9500 - val_perplexity: 52.0006

<keras.src.callbacks.history.History at 0x7f7de0365ba0>

推論

透過我們訓練的模型，我們可以對其進行測試以評估其效能。為此，我們可以從以 "[BOS]" 權杖開頭的輸入序列為我們的模型提供種子，並透過在迴圈中為每個後續權杖進行預測來逐步取樣模型。

首先，讓我們建構一個與我們模型輸入具有相同形狀的提示，其中僅包含 "[BOS]" 權杖。

# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

我們將使用 keras_hub.samplers 模組進行推論，這需要一個包裝我們剛訓練模型的回呼函式。此包裝函式呼叫模型，並傳回我們正在產生的目前權杖的邏輯預測。

注意：在定義回呼時，有兩個更進階的功能可用。第一個是能夠接收先前產生步驟中計算的狀態 cache，這可用於加速產生。第二個是能夠輸出每個產生權杖的最終密集「隱藏狀態」。這由 keras_hub.samplers.ContrastiveSampler 使用，透過懲罰重複的隱藏狀態來避免重複。兩者都是選用的，我們現在將忽略它們。

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache

建立包裝函式是使用這些函式最複雜的部分。現在它完成了，讓我們測試不同的實用程式，從貪婪搜尋開始。

貪婪搜尋

我們在每個時間步驟貪婪地選取最有可能的權杖。換句話說，我們取得模型輸出的 argmax。

sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
[b'[BOS] " i \' m going to tell you , " said the boy , " i \' ll tell you , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good']

如您所見，貪婪搜尋開始時還有些道理，但很快就開始重複自己。這是文字產生中常見的問題，可以使用稍後顯示的某些機率文字產生實用程式來修正！

束搜尋

在較高的層級，束搜尋會在每個時間步驟追蹤 num_beams 個最有可能的序列，並從所有序列中預測最佳的下一個權杖。它是對貪婪搜尋的改進，因為它儲存了更多可能性。但是，它的效率不如貪婪搜尋，因為它必須計算和儲存多個潛在序列。

注意： num_beams=1 的束搜尋與貪婪搜尋相同。

sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
[b'[BOS] " i don \' t know anything about it , " she said . " i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \'']

與貪婪搜尋類似，束搜尋很快就開始重複自己，因為它仍然是一種確定性方法。

隨機搜尋

隨機搜尋是我們的第一個機率方法。在每個時間步驟，它使用模型提供的 softmax 機率取樣下一個權杖。

sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
[b'[BOS] eleanor . like ice , not children would have suspicious forehead . they will see him , no goods in her plums . i have made a stump one , on the occasion , - - it is sacred , and one is unholy - plaything - - the partial consequences , and one refuge in a style of a boy , who was his grandmother . it was a young gentleman who bore off upon the middle of the day , rush and as he maltreated the female society , were growing at once . in and out of the craid little plays , stopping']

瞧，沒有重複了！但是，使用隨機搜尋，我們可能會看到一些沒有意義的詞彙出現，因為詞彙表中的任何詞彙都有機會使用此取樣方法出現。這會由我們的下一個搜尋實用程式頂端 K 搜尋修正。

頂端 K 搜尋

與隨機搜尋類似，我們會從模型提供的機率分配中取樣下一個權杖。唯一的不同之處在於，在此處，我們會選取最有可能的前 k 個權杖，並在取樣之前將機率質量分配到它們上。這樣，我們就不會從低機率權杖取樣，因此我們將擁有較少沒有意義的詞彙！

sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
[b'[BOS] " the young man was not the one , and the boy went away to the green forest . they were a little girl \' s wife , and the child loved him as much as he did , and he had often heard of a little girl who lived near the house . they were too tired to go , and when they went down to the barns and get into the barn , and they got the first of the barns that they had been taught to do so , and the little people went to their homes . she did , she told them that she had been a very clever , and they had made the first . she knew they']

頂端 P 搜尋

即使使用頂端 K 搜尋，仍有一些地方需要改進。使用頂端 K 搜尋時，數字 k 是固定的，這表示它會為任何機率分配選取相同數量的權杖。考慮兩種情境，一種是機率質量集中在 2 個詞彙上，另一種是機率質量均勻地集中在 10 個詞彙上。我們應該選擇 k=2 還是 k=10？這裡沒有一種尺寸適合所有 k。

這就是頂端 P 搜尋的用武之地！我們不選擇 k，而是選擇一個機率 p，我們希望頂端權杖的機率總和達到這個值。這樣，我們可以根據機率分配動態調整 k。透過設定 p=0.9，如果 90% 的機率質量集中在前 2 個權杖上，我們可以篩選出前 2 個權杖進行取樣。如果 90% 反而分佈在 10 個權杖上，它也會類似地篩選出前 10 個權杖進行取樣。

sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
[b'[BOS] the children were both born in the spring , and the youngest sister were very much like the other children , but they did not see them . they were very happy , and their mother was a beautiful one . the youngest was one of the youngest sister of the youngest , and the youngest baby was very fond of the children . when they came home , they would see a little girl in the house , and had the beautiful family , and the children of the children had to sit and look on their backs , and the eldest children were very long , and they were so bright and happy , as they were , they had never noticed their hair ,']

使用回呼進行文字產生

我們也可以將實用程式包裝在回呼中，這可讓您為模型的每個時期列印出預測序列！以下是頂端 K 搜尋的回呼範例

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_hub.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
[b"[BOS] the young man was in the middle of a month , and he was able to take the crotch , but a long time , for he felt very well for himself in the sepoys ' s hands were chalks . he was the only boy , and he had a few years before been married , and the man said he was a tall one . he was a very handsome , and he was a very handsome young fellow , and a handsome , noble young man , but a boy , and man . he was a very handsome man , and was tall and handsome , and he looked like a gentleman . he was an"]

1/1 - 16s - 16s/step - loss: 3.9454 - perplexity: 51.6987
Epoch 2/2
Top-K search generated text: 
[b'[BOS] " well , it is true . it is true that i should go to the house of a collector , in the matter of prussia that there is no other way there . there is no chance of being in the habit of being in the way of an invasion . i know not what i have done , but i have seen the man in the middle of a day . the next morning i shall take him to my father , for i am not the very day of the town , which would have been a little more than the one \' s daughter , i think it over and the whole affair will be']

1/1 - 17s - 17s/step - loss: 3.7860 - perplexity: 44.0932

<keras.src.callbacks.history.History at 0x7f7de0325600>

結論

總結來說，在這個範例中，我們使用 KerasHub 層來訓練子詞詞彙表、將訓練資料標記化、建立一個迷你 GPT 模型，並使用文字生成函式庫執行推論。

如果您想了解 Transformer 的運作方式，或想進一步了解如何訓練完整的 GPT 模型，以下是一些延伸閱讀資料：

Attention Is All You Need Vaswani 等人，2017
GPT-3 論文 Brown 等人，2020

使用 KerasHub 從頭開始生成 GPT 文字

◆ 簡介

◆ 設定

◆ 設定與超參數

◆ 載入資料

◆ 訓練分詞器

◆ 載入分詞器

◆ 標記資料

◆ 建立模型

◆ 訓練

◆ 推論

貪婪搜尋

束搜尋

隨機搜尋

頂端 K 搜尋

頂端 P 搜尋

使用回呼進行文字產生

◆ 結論