作者: Chen Qian
建立日期 2023/04/17
上次修改日期 2024/04/12
說明: 使用 KerasHub GPT2 模型和 samplers
進行文字生成。
在本教學中,您將學習如何使用 KerasHub 來載入預訓練的大型語言模型 (LLM) - GPT-2 模型 (最初由 OpenAI 發明),將其微調至特定的文字風格,並根據用戶的輸入 (也稱為提示) 生成文字。您還將學習 GPT2 如何快速適應非英語語言,例如中文。
Colab 提供不同類型的執行環境。請務必前往 執行階段 -> 變更執行階段類型 並選擇 GPU 硬體加速器執行階段 (其應具有 >12G 主機 RAM 和 ~15G GPU RAM),因為您將微調 GPT-2 模型。在 CPU 執行階段執行本教學將耗費數小時。
此範例使用 Keras 3 以在任何 "tensorflow"
、"jax"
或 "torch"
中運作。KerasHub 中已內建 Keras 3 的支援,只需變更 "KERAS_BACKEND"
環境變數即可選擇您所選的後端。我們在下面選擇 JAX 後端。
!pip install git+https://github.com/keras-team/keras-hub.git -q
import os
os.environ["KERAS_BACKEND"] = "jax" # or "tensorflow" or "torch"
import keras_hub
import keras
import tensorflow as tf
import time
keras.mixed_precision.set_global_policy("mixed_float16")
大型語言模型 (LLM) 是一種機器學習模型,經過大量文字資料的訓練,可以為各種自然語言處理 (NLP) 任務生成輸出,例如文字生成、問題回答和機器翻譯。
生成式 LLM 通常基於深度學習神經網路,例如 Google 研究人員在 2017 年發明的 Transformer 架構,並在大量的文字資料上進行訓練,通常涉及數十億個單字。這些模型,例如 Google LaMDA 和 PaLM,使用來自各種資料來源的大型資料集進行訓練,使其能夠為許多任務生成輸出。生成式 LLM 的核心是預測句子中的下一個單字,通常稱為 因果 LM 預訓練。透過這種方式,LLM 可以根據使用者提示生成連貫的文字。有關語言模型的更具教學性的討論,您可以參考 史丹佛 CS324 LLM 課程。
大型語言模型從頭開始建構和訓練既複雜又昂貴。幸運的是,有預訓練的 LLM 可以立即使用。KerasHub 提供了大量的預訓練檢查點,讓您無需自行訓練即可試用 SOTA 模型。
KerasHub 是一個自然語言處理程式庫,在整個開發週期中為使用者提供支援。KerasHub 提供預訓練模型和模組化建構組塊,因此開發人員可以輕鬆重複使用預訓練模型或堆疊自己的 LLM。
簡而言之,對於生成式 LLM,KerasHub 提供
generate()
方法的預訓練模型,例如,keras_hub.models.GPT2CausalLM
和 keras_hub.models.OPTCausalLM
。KerasHub 提供了許多預訓練模型,例如 Google Bert 和 GPT-2。您可以在 KerasHub 存放庫中看到可用的模型清單。
如下所示,載入 GPT-2 模型非常容易
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset(
"gpt2_base_en",
sequence_length=128,
)
gpt2_lm = keras_hub.models.GPT2CausalLM.from_preset(
"gpt2_base_en", preprocessor=preprocessor
)
載入模型後,您可以立即使用它來產生一些文字。執行下面的儲存格以嘗試一下。它就像呼叫單一函式 generate() 一樣簡單
start = time.time()
output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
My trip to Yosemite was pretty awesome. The first time I went I didn't know how to go and it was pretty hard to get around. It was a bit like going on an adventure with a friend. The only things I could do were hike and climb the mountain. It's really cool to know you're not alone in this world. It's a lot of fun. I'm a little worried that I might not get to the top of the mountain in time to see the sunrise and sunset of the day. I think the weather is going to get a little warmer in the coming years.
This post is a little more in-depth on how to go on the trail. It covers how to hike on the Sierra Nevada, how to hike with the Sierra Nevada, how to hike in the Sierra Nevada, how to get to the top of the mountain, and how to get to the top with your own gear.
The Sierra Nevada is a very popular trail in Yosemite
TOTAL TIME ELAPSED: 25.36s
再試一次
start = time.time()
output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
That Italian restaurant is known for its delicious food, and the best part is that it has a full bar, with seating for a whole host of guests. And that's only because it's located at the heart of the neighborhood.
The menu at the Italian restaurant is pretty straightforward:
The menu consists of three main dishes:
Italian sausage
Bolognese
Sausage
Bolognese with cheese
Sauce with cream
Italian sausage with cheese
Bolognese with cheese
And the main menu consists of a few other things.
There are two tables: the one that serves a menu of sausage and bolognese with cheese (the one that serves the menu of sausage and bolognese with cheese) and the one that serves the menu of sausage and bolognese with cheese. The two tables are also open 24 hours a day, 7 days a week.
TOTAL TIME ELAPSED: 1.55s
請注意,第二次呼叫的速度快得多。這是因為計算圖在第一次執行中是 XLA 編譯的,並在幕後重複使用於第二次執行。
產生的文字品質看起來還可以,但我們可以透過微調來改進它。
接下來,我們實際上將微調模型以更新其參數,但在執行之前,讓我們先看一下我們可用於處理 GPT2 的完整工具集。
GPT2 的程式碼可以在 這裡找到。從概念上講,GPT2CausalLM
可以在 KerasHub 中分層分解為多個模組,所有這些模組都有一個載入預訓練模型的 from_preset() 函式
keras_hub.models.GPT2Tokenizer
:GPT2 模型使用的分詞器,它是一個 位元組對編碼器。keras_hub.models.GPT2CausalLMPreprocessor
:GPT2 因果 LM 訓練使用的預處理器。它會執行分詞以及其他預處理工作,例如建立標籤和附加結束符號。keras_hub.models.GPT2Backbone
:GPT2 模型,它是 keras_hub.layers.TransformerDecoder
的堆疊。這通常簡稱為 GPT2
。keras_hub.models.GPT2CausalLM
:包裝 GPT2Backbone
,它將 GPT2Backbone
的輸出乘以嵌入矩陣,以在詞彙符號上產生 logits。現在您已了解 KerasHub 的 GPT-2 模型,您可以更進一步微調模型,使其以特定的風格、短或長、嚴格或隨意的風格產生文字。在本教學中,我們將以 reddit 資料集為例。
import tensorflow_datasets as tfds
reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)
讓我們看看 reddit TensorFlow 資料集中範例資料的內部。有兩個特徵
for document, title in reddit_ds:
print(document.numpy())
print(title.numpy())
break
b"me and a friend decided to go to the beach last sunday. we loaded up and headed out. we were about half way there when i decided that i was not leaving till i had seafood. \n\nnow i'm not talking about red lobster. no friends i'm talking about a low country boil. i found the restaurant and got directions. i don't know if any of you have heard about the crab shack on tybee island but let me tell you it's worth it. \n\nwe arrived and was seated quickly. we decided to get a seafood sampler for two and split it. the waitress bought it out on separate platters for us. the amount of food was staggering. two types of crab, shrimp, mussels, crawfish, andouille sausage, red potatoes, and corn on the cob. i managed to finish it and some of my friends crawfish and mussels. it was a day to be a fat ass. we finished paid for our food and headed to the beach. \n\nfunny thing about seafood. it runs through me faster than a kenyan \n\nwe arrived and walked around a bit. it was about 45min since we arrived at the beach when i felt a rumble from the depths of my stomach. i ignored it i didn't want my stomach to ruin our fun. i pushed down the feeling and continued. about 15min later the feeling was back and stronger than before. again i ignored it and continued. 5min later it felt like a nuclear reactor had just exploded in my stomach. i started running. i yelled to my friend to hurry the fuck up. \n\nrunning in sand is extremely hard if you did not know this. we got in his car and i yelled at him to floor it. my stomach was screaming and if he didn't hurry i was gonna have this baby in his car and it wasn't gonna be pretty. after a few red lights and me screaming like a woman in labor we made it to the store. \n\ni practically tore his car door open and ran inside. i ran to the bathroom opened the door and barely got my pants down before the dam burst and a flood of shit poured from my ass. \n\ni finished up when i felt something wet on my ass. i rubbed it thinking it was back splash. no, mass was covered in the after math of me abusing the toilet. i grabbed all the paper towels i could and gave my self a whores bath right there. \n\ni sprayed the bathroom down with the air freshener and left. an elderly lady walked in quickly and closed the door. i was just about to walk away when i heard gag. instead of walking i ran. i got to the car and told him to get the hell out of there."
b'liking seafood'
在我們的例子中,我們在語言模型中執行下一個單字預測,因此我們只需要「document」特徵。
train_ds = (
reddit_ds.map(lambda document, _: document)
.batch(32)
.cache()
.prefetch(tf.data.AUTOTUNE)
)
現在您可以使用熟悉的 fit() 函式來微調模型。請注意,由於 GPT2CausalLM
是 keras_hub.models.Task
執行個體,因此將在 fit
方法中自動呼叫 preprocessor
。
如果我們要將其一路訓練到完全訓練的狀態,此步驟將佔用相當多的 GPU 記憶體和長時間。在這裡,我們僅使用部分資料集作為示範用途。
train_ds = train_ds.take(500)
num_epochs = 1
# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
5e-5,
decay_steps=train_ds.cardinality() * num_epochs,
end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
optimizer=keras.optimizers.Adam(learning_rate),
loss=loss,
weighted_metrics=["accuracy"],
)
gpt2_lm.fit(train_ds, epochs=num_epochs)
500/500 ━━━━━━━━━━━━━━━━━━━━ 75s 120ms/step - accuracy: 0.3189 - loss: 3.3653
<keras.src.callbacks.history.History at 0x7f2af3fda410>
微調完成後,您可以再次使用相同的 generate() 函式來產生文字。這次,文字將更接近 Reddit 的寫作風格,而產生的長度將接近我們在訓練集中預設的長度。
start = time.time()
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
GPT-2 output:
I like basketball. it has the greatest shot of all time and the best shot of all time. i have to play a little bit more and get some practice time.
today i got the opportunity to play in a tournament in a city that is very close to my school so i was excited to see how it would go. i had just been playing with a few other guys, so i thought i would go and play a couple games with them.
after a few games i was pretty confident and confident in myself. i had just gotten the opportunity and had to get some practice time.
so i go to the
TOTAL TIME ELAPSED: 21.13s
在 KerasHub 中,我們提供了一些取樣方法,例如對比搜尋、Top-K 和波束取樣。預設情況下,我們的 GPT2CausalLM
使用 Top-k 搜尋,但您可以選擇自己的取樣方法。
與最佳化器和啟用類似,有兩種方法可以指定自訂取樣器
keras_hub.samplers.Sampler
執行個體,您可以透過此方式使用自訂配置。# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_hub.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
GPT-2 output:
I like basketball, and this is a pretty good one.
first off, my wife is pretty good, she is a very good basketball player and she is really, really good at playing basketball.
she has an amazing game called basketball, it is a pretty fun game.
i play it on the couch. i'm sitting there, watching the game on the couch. my wife is playing with her phone. she's playing on the phone with a bunch of people.
my wife is sitting there and watching basketball. she's sitting there watching
GPT-2 output:
I like basketball, but i don't like to play it.
so i was playing basketball at my local high school, and i was playing with my friends.
i was playing with my friends, and i was playing with my brother, who was playing basketball with his brother.
so i was playing with my brother, and he was playing with his brother's brother.
so i was playing with my brother, and he was playing with his brother's brother.
so i was playing with my brother, and he was playing with his brother's brother.
so i was playing with my brother, and he was playing with his brother's brother.
so i was playing with my brother, and he was playing with his brother's brother.
so i was playing with my brother, and he was playing with his brother
有關 KerasHub Sampler
類別的更多詳細資訊,您可以查看 這裡的程式碼。
我們也可以在非英文資料集上微調 GPT2。對於懂中文的讀者,這部分說明如何在中文詩歌資料集上微調 GPT2,以教導我們的模型成為詩人!
由於 GPT2 使用位元組對編碼器,並且原始預訓練資料集包含一些中文字符,因此我們可以利用原始詞彙表在中文資料集上進行微調。
!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git
Cloning into 'chinese-poetry'...
從 json 檔案載入文字。我們僅以《全唐詩》作為示範用途。
import os
import json
poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
if ".json" not in file or "poet" not in file:
continue
full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
with open(full_filename, "r") as f:
content = json.load(f)
poem_collection.extend(content)
paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]
讓我們看看範例資料。
print(paragraphs[0])
毋謂支山險,此山能幾何。崎嶔十年夢,知歷幾蹉跎。
與 Reddit 範例類似,我們轉換為 TF 資料集,並且僅使用部分資料進行訓練。
train_ds = (
tf.data.Dataset.from_tensor_slices(paragraphs)
.batch(16)
.cache()
.prefetch(tf.data.AUTOTUNE)
)
# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1
learning_rate = keras.optimizers.schedules.PolynomialDecay(
5e-4,
decay_steps=train_ds.cardinality() * num_epochs,
end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
optimizer=keras.optimizers.Adam(learning_rate),
loss=loss,
weighted_metrics=["accuracy"],
)
gpt2_lm.fit(train_ds, epochs=num_epochs)
500/500 ━━━━━━━━━━━━━━━━━━━━ 49s 71ms/step - accuracy: 0.2357 - loss: 2.8196
<keras.src.callbacks.history.History at 0x7f2b2c192bc0>
讓我們檢查結果!
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)
昨夜雨疏风骤,爲臨江山院短靜。石淡山陵長爲羣,臨石山非處臨羣。美陪河埃聲爲羣,漏漏漏邊陵塘
還不錯 😀