作者: Sreyan Ghosh
建立日期 2022/07/04
最後修改日期 2022/08/28
描述: 使用 Hugging Face Transformers 訓練 T5 進行抽象式摘要。
自動摘要是自然語言處理 (NLP) 的核心問題之一。它提出了幾個與語言理解(例如,識別重要內容)和生成(例如,將識別的內容聚合和改寫成摘要)相關的挑戰。
在本教學中,我們使用抽象建模方法來解決單文檔摘要任務。這裡的主要想法是生成一個簡短的單句新聞摘要,回答問題「這篇新聞文章是關於什麼的?」。這種摘要方法也稱為抽象式摘要,並已引起各學科研究人員越來越多的興趣。
依照先前的研究,我們旨在使用序列到序列模型來解決這個問題。 文字到文字轉換 Transformer (T5
) 是一個基於 Transformer 的模型,建立在編碼器-解碼器架構之上,在無監督和有監督任務的多任務混合上進行預訓練,其中每個任務都轉換為文字到文字格式。T5 在各種序列到序列任務(此筆記本中的序列指的是文字)中顯示出令人印象深刻的結果,例如摘要、翻譯等。
在本筆記本中,我們將使用 Hugging Face Transformers 在從 Hugging Face Datasets 載入的 XSum
資料集上,微調預訓練的 T5 以進行抽象式摘要任務。
!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score
import os
import logging
import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras
# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1
MAX_INPUT_LENGTH = 1024 # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5 # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128 # Maximum length of the output by the model
BATCH_SIZE = 8 # Batch-size for training our model
LEARNING_RATE = 2e-5 # Learning-rate for training our model
MAX_EPOCHS = 1 # Maximum number of epochs we will train the model for
# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"
我們現在將下載 極端摘要 (XSum)。資料集包含 BBC 文章和隨附的單句摘要。具體來說,每篇文章都以一篇介紹性句子(又稱摘要)作為前言,這句話通常是由文章作者撰寫的。該資料集有 226,711 篇文章,分為訓練集 (90%, 204,045)、驗證集 (5%, 11,332) 和測試集 (5%, 11,334)。
遵循許多文獻,我們使用以召回為導向的概要評估 (ROUGE) 指標來評估我們的序列到序列抽象式摘要方法。
我們將使用 Hugging Face Datasets 程式庫來下載我們需要用於訓練和評估的資料。這可以透過 load_dataset
函數輕鬆完成。
from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")
資料集具有以下欄位
print(raw_datasets)
Dataset({
features: ['document', 'summary', 'id'],
num_rows: 204045
})
我們現在將看看資料的樣子
print(raw_datasets[0])
{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}
為了示範工作流程,在本筆記本中,我們將僅取訓練集的少量分層平衡分割 (10%) 作為我們的訓練集和測試集。我們可以透過 train_test_split
方法輕鬆分割資料集,該方法期望分割大小和要根據其進行分層的欄位的名稱。
raw_datasets = raw_datasets.train_test_split(
train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)
在將這些文字輸入到我們的模型之前,我們需要對它們進行預處理並為任務做好準備。這由 Hugging Face Transformers 的 Tokenizer
完成,它會對輸入進行符號化(包括將符號轉換為它們在預訓練詞彙表中的對應 ID),並將其放入模型期望的格式中,並產生模型需要的其他輸入。
from_pretrained()
方法期望來自 Hugging Face 模型中心的模型名稱。這與先前宣告的 MODEL_CHECKPOINT 完全相似,我們只會傳遞它。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
如果您使用的是五個 T5 檢查點之一,則必須在輸入中加入 "summarize:" 前綴(該模型也可以翻譯,並且需要前綴才能知道它必須執行哪個任務)。
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
prefix = "summarize: "
else:
prefix = ""
我們將編寫一個簡單的函數,以協助我們進行與 Hugging Face Datasets 相容的預處理。總之,我們的預處理函數應該
token_type_ids
、attention_mask
等。def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(
examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
為了將此函數套用在資料集中所有句子對上,我們只需使用我們先前建立的 dataset
物件的 map
方法。這將將函數套用在 dataset
中所有分割的所有元素上,因此我們的訓練和測試資料將在一個指令中進行預處理。
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
現在我們可以下載預訓練模型並微調它。由於我們的任務是序列到序列(輸入和輸出都是文字序列),我們使用 Hugging Face Transformers 程式庫中的 TFAutoModelForSeq2SeqLM
類別。與符號化工具一樣,from_pretrained
方法將下載並快取我們的模型。
from_pretrained()
方法期望來自 Hugging Face 模型中心的模型名稱。如先前所述,我們將使用 t5-small
模型檢查點。
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
Downloading: 0%| | 0.00/231M [00:00<?, ?B/s]
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.
All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
對於訓練序列到序列模型,我們需要一種特殊的資料整理器,它不僅會將輸入填充到批次中的最大長度,還會填充標籤。因此,我們在我們的資料集上使用 Hugging Face Transformers 程式庫提供的 DataCollatorForSeq2Seq
。return_tensors='tf'
確保我們取回 tf.Tensor
物件。
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
接下來,我們定義我們的訓練集和測試集,我們將使用這些集合來訓練我們的模型。同樣地,Hugging Face Datasets 為我們提供了 to_tf_dataset
方法,這將幫助我們將資料集與上面定義的 collator
整合。該方法期望某些參數
此外,我們還定義一個相對較小的 generation_dataset
,以在訓練時動態計算 ROUGE
分數。
train_dataset = tokenized_datasets["train"].to_tf_dataset(
batch_size=BATCH_SIZE,
columns=["input_ids", "attention_mask", "labels"],
shuffle=True,
collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
batch_size=BATCH_SIZE,
columns=["input_ids", "attention_mask", "labels"],
shuffle=False,
collate_fn=data_collator,
)
generation_dataset = (
tokenized_datasets["test"]
.shuffle()
.select(list(range(200)))
.to_tf_dataset(
batch_size=BATCH_SIZE,
columns=["input_ids", "attention_mask", "labels"],
shuffle=False,
collate_fn=data_collator,
)
)
現在我們將定義我們的最佳化器並編譯模型。損失計算是在內部處理的,因此我們無需擔心!
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
為了在訓練時動態評估我們的模型,我們將定義 metric_fn
,它將計算 ground-truth 和預測之間的 ROUGE
分數。
import keras_hub
rouge_l = keras_hub.metrics.RougeL()
def metric_fn(eval_predictions):
predictions, labels = eval_predictions
decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
for label in labels:
label[label < 0] = tokenizer.pad_token_id # Replace masked label tokens
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
result = rouge_l(decoded_labels, decoded_predictions)
# We will print only the F1 score, you can use other aggregation metrics as well
result = {"RougeL": result["f1_score"]}
return result
現在我們終於可以開始訓練我們的模型了!
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(
metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)
callbacks = [metric_callback]
# For now we will use our test set as our validation_data
model.fit(
train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)
WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.
2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065
<keras.callbacks.History at 0x7f1d002f9810>
為了獲得最佳結果,我們建議在整個訓練資料集上至少訓練模型 5 個週期!
現在我們將嘗試推斷我們在任意文章上訓練的模型。為此,我們將使用 Hugging Face Transformers 中的 pipeline
方法。Hugging Face Transformers 為我們提供了多種管道可供選擇。對於我們的任務,我們使用 summarization
管道。
pipeline
方法將訓練的模型和符號化工具作為引數。framework="tf"
引數確保您傳遞的是使用 TF 訓練的模型。
from transformers import pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
summarizer(
raw_datasets["test"][0]["document"],
min_length=MIN_TARGET_LENGTH,
max_length=MAX_TARGET_LENGTH,
)
Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]
現在您可以將此模型推送到 Hugging Face Model Hub,並與您的所有朋友、家人、最愛的寵物分享:他們都可以使用識別碼 "您的使用者名稱/您選擇的名稱"
來載入它,例如:
model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")
在您推送模型後,這就是您未來載入它的方式!
from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")