程式碼範例 / 生成式深度學習 / 使用 KerasNLP 進行 GPT2 文字生成

使用 KerasNLP 進行 GPT2 文字生成

作者:陳倩
建立日期 2023/04/17
上次修改日期 2024/04/12
說明:使用 KerasNLP GPT2 模型和 samplers 進行文字生成。

ⓘ 此範例使用 Keras 3

在 Colab 中檢視 GitHub 來源

在本教學課程中,您將學習如何使用 KerasNLP 來載入預先訓練的大型語言模型 (LLM) - GPT-2 模型(最初由 OpenAI 發明),將其微調為特定文字風格,並根據用戶輸入(也稱為提示)生成文字。您還將學習 GPT2 如何快速適應非英語語言,例如中文。


開始之前

Colab 提供不同類型的執行階段。請確保前往「執行階段 -> 變更執行階段類型」,並選擇 GPU 硬體加速器執行階段(應具有 >12G 主機 RAM 和 ~15G GPU RAM),因為您將微調 GPT-2 模型。在 CPU 執行階段執行本教學課程將需要數小時。


安裝 KerasNLP、選擇後端並匯入依賴項

本範例使用 Keras 3 來在 "tensorflow""jax""torch" 中運作。KerasNLP 內建了對 Keras 3 的支援,只需變更 "KERAS_BACKEND" 環境變數即可選擇您想要使用的後端。我們在下方選擇 JAX 後端。

!pip install git+https://github.com/keras-team/keras-nlp.git -q
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras
import tensorflow as tf
import time

keras.mixed_precision.set_global_policy("mixed_float16")


生成式大型語言模型 (LLM) 簡介

大型語言模型 (LLM) 是一種機器學習模型,它們在大型文字資料集上進行訓練,以便為各種自然語言處理 (NLP) 任務生成輸出,例如文字生成、問答和機器翻譯。

生成式 LLM 通常基於深度學習神經網路,例如由 Google 研究人員在 2017 年發明的 Transformer 架構,並且在大量的文字資料上進行訓練,通常涉及數十億個詞彙。這些模型(例如 Google LaMDAPaLM)使用來自各種資料來源的大型資料集進行訓練,這使得它們能夠為許多任務生成輸出。生成式 LLM 的核心是預測句子中的下一個詞彙,通常稱為「因果語言模型預訓練」。透過這種方式,LLM 可以根據使用者提示生成連貫的文字。如需有關語言模型的更多教學討論,您可以參考 史丹佛大學 CS324 LLM 課程


KerasNLP 簡介

大型語言模型的建構非常複雜,而且從頭開始訓練的成本很高。幸運的是,有預先訓練好的 LLM 可供立即使用。KerasNLP 提供了大量的預先訓練好的檢查點,讓您可以實驗 SOTA 模型,而無需自行訓練。

KerasNLP 是一個自然語言處理函式庫,可在整個開發週期中為使用者提供支援。KerasNLP 提供預先訓練好的模型和模組化的建構模組,因此開發人員可以輕鬆地重複使用預先訓練好的模型或堆疊自己的 LLM。

簡而言之,對於生成式 LLM,KerasNLP 提供


載入預先訓練好的 GPT-2 模型並生成一些文字

KerasNLP 提供了許多預先訓練好的模型,例如 Google BertGPT-2。您可以在 KerasNLP 儲存庫 中查看可用模型的清單。

載入 GPT-2 模型非常容易,如下所示

# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

載入模型後,您可以立即使用它來生成一些文字。執行以下儲存格來試用看看。這就像呼叫單一函數 generate() 一樣簡單

start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
My trip to Yosemite was pretty awesome. The first time I went I didn't know how to go and it was pretty hard to get around. It was a bit like going on an adventure with a friend. The only things I could do were hike and climb the mountain. It's really cool to know you're not alone in this world. It's a lot of fun. I'm a little worried that I might not get to the top of the mountain in time to see the sunrise and sunset of the day. I think the weather is going to get a little warmer in the coming years.
This post is a little more in-depth on how to go on the trail. It covers how to hike on the Sierra Nevada, how to hike with the Sierra Nevada, how to hike in the Sierra Nevada, how to get to the top of the mountain, and how to get to the top with your own gear.
The Sierra Nevada is a very popular trail in Yosemite
TOTAL TIME ELAPSED: 25.36s

再試一次

start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
That Italian restaurant is known for its delicious food, and the best part is that it has a full bar, with seating for a whole host of guests. And that's only because it's located at the heart of the neighborhood.
The menu at the Italian restaurant is pretty straightforward:
The menu consists of three main dishes:
Italian sausage
Bolognese
Sausage
Bolognese with cheese
Sauce with cream
Italian sausage with cheese
Bolognese with cheese
And the main menu consists of a few other things.
There are two tables: the one that serves a menu of sausage and bolognese with cheese (the one that serves the menu of sausage and bolognese with cheese) and the one that serves the menu of sausage and bolognese with cheese. The two tables are also open 24 hours a day, 7 days a week.
TOTAL TIME ELAPSED: 1.55s

請注意第二次呼叫的速度有多快。這是因為計算圖表在第一次執行中被 XLA 編譯,並在第二次執行中在幕後重複使用。

生成的文字品質看起來還可以,但我們可以透過微調來改善它。


更多關於 KerasNLP 中 GPT-2 模型的資訊

接下來,我們將實際微調模型以更新其參數,但在開始之前,讓我們先看看我們用於處理 GPT2 的整套工具。

GPT2 的程式碼可以在 這裡 找到。從概念上講,GPT2CausalLM 可以分層分解成 KerasNLP 中的幾個模組,所有這些模組都有一個 from_preset() 函數,可以載入預先訓練好的模型


在 Reddit 資料集上進行微調

現在您已經了解 KerasNLP 中的 GPT-2 模型,您可以更進一步微調該模型,使其以特定的風格(簡短或冗長、嚴謹或隨性)生成文字。在本教學課程中,我們將以 Reddit 資料集為例。

import tensorflow_datasets as tfds

reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)

讓我們看看 Reddit TensorFlow 資料集中的範例資料。它有兩個特徵

  • 文件:貼文的文字。
  • 標題:標題。
for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break
b"me and a friend decided to go to the beach last sunday. we loaded up and headed out. we were about half way there when i decided that i was not leaving till i had seafood. \n\nnow i'm not talking about red lobster. no friends i'm talking about a low country boil. i found the restaurant and got directions. i don't know if any of you have heard about the crab shack on tybee island but let me tell you it's worth it. \n\nwe arrived and was seated quickly. we decided to get a seafood sampler for two and split it. the waitress bought it out on separate platters for us. the amount of food was staggering. two types of crab, shrimp, mussels, crawfish, andouille sausage, red potatoes, and corn on the cob. i managed to finish it and some of my friends crawfish and mussels. it was a day to be a fat ass. we finished paid for our food and headed to the beach. \n\nfunny thing about seafood. it runs through me faster than a kenyan \n\nwe arrived and walked around a bit. it was about 45min since we arrived at the beach when i felt a rumble from the depths of my stomach. i ignored it i didn't want my stomach to ruin our fun. i pushed down the feeling and continued. about 15min later the feeling was back and stronger than before. again i ignored it and continued. 5min later it felt like a nuclear reactor had just exploded in my stomach. i started running. i yelled to my friend to hurry the fuck up. \n\nrunning in sand is extremely hard if you did not know this. we got in his car and i yelled at him to floor it. my stomach was screaming and if he didn't hurry i was gonna have this baby in his car and it wasn't gonna be pretty. after a few red lights and me screaming like a woman in labor we made it to the store. \n\ni practically tore his car door open and ran inside. i ran to the bathroom opened the door and barely got my pants down before the dam burst and a flood of shit poured from my ass. \n\ni finished up when i felt something wet on my ass. i rubbed it thinking it was back splash. no, mass was covered in the after math of me abusing the toilet. i grabbed all the paper towels i could and gave my self a whores bath right there. \n\ni sprayed the bathroom down with the air freshener and left. an elderly lady walked in quickly and closed the door. i was just about to walk away when i heard gag. instead of walking i ran. i got to the car and told him to get the hell out of there."
b'liking seafood'

在我們的例子中,我們正在語言模型中執行下一個單詞預測,因此我們只需要「文件」特徵。

train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

現在,您可以使用熟悉的 fit() 函數微調模型。請注意,由於 GPT2CausalLM 是一個 keras_nlp.models.Task 實例,因此會在 fit 方法內自動呼叫 preprocessor

如果我們要將其一直訓練到完全訓練狀態,則此步驟需要相當多的 GPU 記憶體和很長時間。在這裡,我們僅使用部分資料集進行演示。

train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)
 500/500 ━━━━━━━━━━━━━━━━━━━━ 75s 120ms/step - accuracy: 0.3189 - loss: 3.3653

<keras.src.callbacks.history.History at 0x7f2af3fda410>

微調完成後,您可以再次使用相同的 generate() 函數生成文字。這次,生成的文字將更接近 Reddit 的寫作風格,並且生成的長度將接近我們在訓練集中預設的長度。

start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
I like basketball. it has the greatest shot of all time and the best shot of all time. i have to play a little bit more and get some practice time.
today i got the opportunity to play in a tournament in a city that is very close to my school so i was excited to see how it would go. i had just been playing with a few other guys, so i thought i would go and play a couple games with them. 
after a few games i was pretty confident and confident in myself. i had just gotten the opportunity and had to get some practice time. 
so i go to the
TOTAL TIME ELAPSED: 21.13s

深入了解抽樣方法

在 KerasNLP 中,我們提供了一些抽樣方法,例如對比搜尋、Top-K 和 Beam Search。預設情況下,我們的 GPT2CausalLM 使用 Top-k 搜尋,但您可以選擇自己的抽樣方法。

就像優化器和激活函數一樣,有兩種方法可以指定您的自定義採樣器

  • 使用字串識別碼,例如「greedy」,您將透過這種方式使用預設配置。
  • 傳遞一個 keras_nlp.samplers.Sampler 實例,您可以透過這種方式使用自定義配置。
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
GPT-2 output:
I like basketball, and this is a pretty good one. 
first off, my wife is pretty good, she is a very good basketball player and she is really, really good at playing basketball. 
she has an amazing game called basketball, it is a pretty fun game. 
i play it on the couch.  i'm sitting there, watching the game on the couch.  my wife is playing with her phone.  she's playing on the phone with a bunch of people. 
my wife is sitting there and watching basketball.  she's sitting there watching
GPT-2 output:
I like basketball, but i don't like to play it. 
so i was playing basketball at my local high school, and i was playing with my friends. 
i was playing with my friends, and i was playing with my brother, who was playing basketball with his brother. 
so i was playing with my brother, and he was playing with his brother's brother. 
so i was playing with my brother, and he was playing with his brother's brother. 
so i was playing with my brother, and he was playing with his brother's brother. 
so i was playing with my brother, and he was playing with his brother's brother. 
so i was playing with my brother, and he was playing with his brother's brother. 
so i was playing with my brother, and he was playing with his brother

有關 KerasNLP Sampler 類別的更多詳細資訊,您可以查看 這裡 的程式碼。


在中文詩歌資料集上進行微調

我們也可以在非英語資料集上微調 GPT2。對於懂中文的讀者,這部分說明了如何在中文詩歌資料集上微調 GPT2,以教會我們的模型成為詩人!

由於 GPT2 使用 byte-pair 編碼器,並且原始預訓練資料集包含一些中文字元,因此我們可以使用原始詞彙表對中文資料集進行微調。

!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git
Cloning into 'chinese-poetry'...

從 json 檔案載入文字。我們僅使用《全唐詩》進行演示。

import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

讓我們看看範例資料。

print(paragraphs[0])
毋謂支山險,此山能幾何。崎嶔十年夢,知歷幾蹉跎。

與 Reddit 範例類似,我們將其轉換為 TF 資料集,並且僅使用部分資料進行訓練。

train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)
 500/500 ━━━━━━━━━━━━━━━━━━━━ 49s 71ms/step - accuracy: 0.2357 - loss: 2.8196

<keras.src.callbacks.history.History at 0x7f2b2c192bc0>

讓我們檢查結果!

output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)
昨夜雨疏风骤,爲臨江山院短靜。石淡山陵長爲羣,臨石山非處臨羣。美陪河埃聲爲羣,漏漏漏邊陵塘

還不錯 😀