程式碼範例 / 自然語言處理 / 使用 Hugging Face Transformers 進行抽象式摘要

使用 Hugging Face Transformers 進行抽象式摘要

作者: Sreyan Ghosh
建立日期 2022/07/04
最後修改日期 2022/08/28
描述: 使用 Hugging Face Transformers 訓練 T5 進行抽象式摘要。

ⓘ 這個範例使用 Keras 2

在 Colab 中檢視 GitHub 原始碼


簡介

自動摘要是自然語言處理 (NLP) 的核心問題之一。它提出了幾個與語言理解(例如,識別重要內容)和生成(例如,將識別的內容聚合和改寫成摘要)相關的挑戰。

在本教學中,我們使用抽象建模方法來解決單文檔摘要任務。這裡的主要想法是生成一個簡短的單句新聞摘要,回答問題「這篇新聞文章是關於什麼的?」。這種摘要方法也稱為抽象式摘要,並已引起各學科研究人員越來越多的興趣。

依照先前的研究,我們旨在使用序列到序列模型來解決這個問題。 文字到文字轉換 Transformer (T5) 是一個基於 Transformer 的模型,建立在編碼器-解碼器架構之上,在無監督和有監督任務的多任務混合上進行預訓練,其中每個任務都轉換為文字到文字格式。T5 在各種序列到序列任務(此筆記本中的序列指的是文字)中顯示出令人印象深刻的結果,例如摘要、翻譯等。

在本筆記本中,我們將使用 Hugging Face Transformers 在從 Hugging Face Datasets 載入的 XSum 資料集上,微調預訓練的 T5 以進行抽象式摘要任務。


設定

安裝需求

!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

匯入必要的程式庫

import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

定義某些變數

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

載入資料集

我們現在將下載 極端摘要 (XSum)。資料集包含 BBC 文章和隨附的單句摘要。具體來說,每篇文章都以一篇介紹性句子(又稱摘要)作為前言,這句話通常是由文章作者撰寫的。該資料集有 226,711 篇文章,分為訓練集 (90%, 204,045)、驗證集 (5%, 11,332) 和測試集 (5%, 11,334)。

遵循許多文獻,我們使用以召回為導向的概要評估 (ROUGE) 指標來評估我們的序列到序列抽象式摘要方法。

我們將使用 Hugging Face Datasets 程式庫來下載我們需要用於訓練和評估的資料。這可以透過 load_dataset 函數輕鬆完成。

from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

資料集具有以下欄位

  • document:要摘要的原始 BBC 文章
  • summary:BBC 文章的單句摘要
  • id:文件-摘要對的 ID
print(raw_datasets)
Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

我們現在將看看資料的樣子

print(raw_datasets[0])
{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}

為了示範工作流程,在本筆記本中,我們將僅取訓練集的少量分層平衡分割 (10%) 作為我們的訓練集和測試集。我們可以透過 train_test_split 方法輕鬆分割資料集,該方法期望分割大小和要根據其進行分層的欄位的名稱。

raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

資料預處理

在將這些文字輸入到我們的模型之前,我們需要對它們進行預處理並為任務做好準備。這由 Hugging Face Transformers 的 Tokenizer 完成,它會對輸入進行符號化(包括將符號轉換為它們在預訓練詞彙表中的對應 ID),並將其放入模型期望的格式中,並產生模型需要的其他輸入。

from_pretrained() 方法期望來自 Hugging Face 模型中心的模型名稱。這與先前宣告的 MODEL_CHECKPOINT 完全相似,我們只會傳遞它。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

如果您使用的是五個 T5 檢查點之一,則必須在輸入中加入 "summarize:" 前綴(該模型也可以翻譯,並且需要前綴才能知道它必須執行哪個任務)。

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

我們將編寫一個簡單的函數,以協助我們進行與 Hugging Face Datasets 相容的預處理。總之,我們的預處理函數應該

  • 將文字資料集(輸入和目標)符號化為其對應的符號 ID,這些 ID 將用於 BERT 中的嵌入查詢
  • 將前綴新增至符號
  • 為模型建立其他輸入,例如 token_type_idsattention_mask 等。
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

為了將此函數套用在資料集中所有句子對上,我們只需使用我們先前建立的 dataset 物件的 map 方法。這將將函數套用在 dataset 中所有分割的所有元素上,因此我們的訓練和測試資料將在一個指令中進行預處理。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

定義模型

現在我們可以下載預訓練模型並微調它。由於我們的任務是序列到序列(輸入和輸出都是文字序列),我們使用 Hugging Face Transformers 程式庫中的 TFAutoModelForSeq2SeqLM 類別。與符號化工具一樣,from_pretrained 方法將下載並快取我們的模型。

from_pretrained() 方法期望來自 Hugging Face 模型中心的模型名稱。如先前所述,我們將使用 t5-small 模型檢查點。

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.
All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.

對於訓練序列到序列模型,我們需要一種特殊的資料整理器,它不僅會將輸入填充到批次中的最大長度,還會填充標籤。因此,我們在我們的資料集上使用 Hugging Face Transformers 程式庫提供的 DataCollatorForSeq2Seqreturn_tensors='tf' 確保我們取回 tf.Tensor 物件。

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

接下來,我們定義我們的訓練集和測試集,我們將使用這些集合來訓練我們的模型。同樣地,Hugging Face Datasets 為我們提供了 to_tf_dataset 方法,這將幫助我們將資料集與上面定義的 collator 整合。該方法期望某些參數

  • columns:將作為我們自變數的欄位
  • batch_size:我們用於訓練的批次大小
  • shuffle:我們是否要隨機播放我們的資料集
  • collate_fn:我們的整理器函數

此外,我們還定義一個相對較小的 generation_dataset,以在訓練時動態計算 ROUGE 分數。

train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

建立和編譯模型

現在我們將定義我們的最佳化器並編譯模型。損失計算是在內部處理的,因此我們無需擔心!

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

訓練和評估模型

為了在訓練時動態評估我們的模型,我們將定義 metric_fn,它將計算 ground-truth 和預測之間的 ROUGE 分數。

import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

現在我們終於可以開始訓練我們的模型了!

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)
WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.

2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065

<keras.callbacks.History at 0x7f1d002f9810>

為了獲得最佳結果,我們建議在整個訓練資料集上至少訓練模型 5 個週期!


推論

現在我們將嘗試推斷我們在任意文章上訓練的模型。為此,我們將使用 Hugging Face Transformers 中的 pipeline 方法。Hugging Face Transformers 為我們提供了多種管道可供選擇。對於我們的任務,我們使用 summarization 管道。

pipeline 方法將訓練的模型和符號化工具作為引數。framework="tf" 引數確保您傳遞的是使用 TF 訓練的模型。

from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)
Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)

[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]

現在您可以將此模型推送到 Hugging Face Model Hub,並與您的所有朋友、家人、最愛的寵物分享:他們都可以使用識別碼 "您的使用者名稱/您選擇的名稱" 來載入它,例如:

model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")

在您推送模型後,這就是您未來載入它的方式!

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")