► 程式碼範例 / 自然語言處理 / 大規模多標籤文本分類

大規模多標籤文本分類

作者： Sayak Paul、Soumik Rakshit
建立日期 2020/09/25
上次修改日期 2025/02/27
說明： 實作大規模多標籤文本分類模型。

ⓘ 此範例使用 Keras 2

簡介

在此範例中，我們將建構一個多標籤文本分類器，以根據 arXiv 論文的摘要主體預測其主題領域。這種類型的分類器對於像 OpenReview 這樣的會議投稿入口網站非常有用。給定論文摘要，入口網站可以提供關於論文最適合哪個領域的建議。

此資料集是使用 arXiv Python 函式庫收集的，該函式庫提供原始 arXiv API 的封裝器。若要深入瞭解資料收集過程，請參閱此筆記本。此外，您也可以在 Kaggle 上找到此資料集。

匯入

import os

os.environ["KERAS_BACKEND"] = "jax"  # or tensorflow, or torch

import keras
from keras import layers, ops

from sklearn.model_selection import train_test_split

from ast import literal_eval
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

執行探索性資料分析

在本節中，我們先將資料集載入到 pandas 資料框架中，然後執行一些基本探索性資料分析 (EDA)。

arxiv_data = pd.read_csv(
    "https://github.com/soumik12345/multi-label-text-classification/releases/download/v0.2/arxiv_data.csv"
)
arxiv_data.head()

	標題	摘要	術語
0	語義立體匹配 / 語義調查...	立體匹配是廣泛使用的技術之一...	['cs.CV', 'cs.LG']
1	FUTURE-AI：指導原則和共識再...	人工智慧的最新進展...	['cs.CV', 'cs.AI', 'cs.LG']
2	強制執行硬區域的相互一致性 f...	在本文中，我們提出了一種新穎的相互一致性...	['cs.CV', 'cs.AI']
3	半監督參數解耦策略...	一致性訓練已被證明是一種優勢...	['cs.CV']
4	室內背景前景分割...	為了確保自動駕駛的安全性，正確...	['cs.CV', 'cs.LG']

我們的文本特徵存在於 summaries 欄位中，其對應的標籤位於 terms 中。如您所見，特定條目有多個類別與之關聯。

print(f"There are {len(arxiv_data)} rows in the dataset.")

There are 2000 rows in the dataset.

真實世界的資料是嘈雜的。最常見的雜訊來源之一是資料重複。在這裡我們注意到，我們的初始資料集大約有 13k 個重複條目。

total_duplicate_titles = sum(arxiv_data["titles"].duplicated())
print(f"There are {total_duplicate_titles} duplicate titles.")

There are 9 duplicate titles.

在繼續進行之前，我們先刪除這些條目。

arxiv_data = arxiv_data[~arxiv_data["titles"].duplicated()]
print(f"There are {len(arxiv_data)} rows in the deduplicated dataset.")

# There are some terms with occurrence as low as 1.
print(sum(arxiv_data["terms"].value_counts() == 1))

# How many unique terms?
print(arxiv_data["terms"].nunique())

There are 1991 rows in the deduplicated dataset.
208
275

如上所述，在 3,157 種唯一的 terms 組合中，有 2,321 個條目的出現次數最少。為了準備我們的訓練、驗證和測試集並進行分層，我們需要刪除這些術語。

# Filtering the rare terms.
arxiv_data_filtered = arxiv_data.groupby("terms").filter(lambda x: len(x) > 1)
arxiv_data_filtered.shape

(1783, 3)

將字串標籤轉換為字串列表

初始標籤表示為原始字串。在這裡我們將它們設為 List[str] 以獲得更緊湊的表示形式。

arxiv_data_filtered["terms"] = arxiv_data_filtered["terms"].apply(
    lambda x: literal_eval(x)
)
arxiv_data_filtered["terms"].values[:5]

array([list(['cs.CV', 'cs.LG']), list(['cs.CV', 'cs.AI', 'cs.LG']),
       list(['cs.CV', 'cs.AI']), list(['cs.CV']),
       list(['cs.CV', 'cs.LG'])], dtype=object)

由於類別不平衡，因此使用分層分割

此資料集存在類別不平衡問題。因此，為了獲得公平的評估結果，我們需要確保資料集以分層方式取樣。若要深入瞭解處理類別不平衡問題的不同策略，您可以參考本教學課程。如需使用不平衡資料進行分類的端對端示範，請參閱不平衡分類：信用卡詐欺偵測。

test_split = 0.1

# Initial train and test split.
train_df, test_df = train_test_split(
    arxiv_data_filtered,
    test_size=test_split,
    stratify=arxiv_data_filtered["terms"].values,
)

# Splitting the test set further into validation
# and new test sets.
val_df = test_df.sample(frac=0.5)
test_df.drop(val_df.index, inplace=True)

print(f"Number of rows in training set: {len(train_df)}")
print(f"Number of rows in validation set: {len(val_df)}")
print(f"Number of rows in test set: {len(test_df)}")

Number of rows in training set: 1604
Number of rows in validation set: 90
Number of rows in test set: 89

多標籤二元化

現在我們使用 StringLookup 層預處理我們的標籤。

# For RaggedTensor
import tensorflow as tf

terms = tf.ragged.constant(train_df["terms"].values)
lookup = layers.StringLookup(output_mode="multi_hot")
lookup.adapt(terms)
vocab = lookup.get_vocabulary()


def invert_multi_hot(encoded_labels):
    """Reverse a single multi-hot encoded label to a tuple of vocab terms."""
    hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
    return np.take(vocab, hot_indices)


print("Vocabulary:\n")
print(vocab)

Vocabulary:

['[UNK]', 'cs.CV', 'cs.LG', 'cs.AI', 'stat.ML', 'eess.IV', 'cs.NE', 'cs.RO', 'cs.CL', 'cs.SI', 'cs.MM', 'math.NA', 'cs.CG', 'cs.CR', 'I.4.6', 'math.OC', 'cs.GR', 'cs.NA', 'cs.HC', 'cs.DS', '68U10', 'stat.ME', 'q-bio.NC', 'math.AP', 'eess.SP', 'cs.DM', '62H30']

在這裡，我們將標籤池中可用的個別唯一類別分開，然後使用此資訊以 0 和 1 表示給定的標籤集。以下是一個範例。

sample_label = train_df["terms"].iloc[0]
print(f"Original label: {sample_label}")

label_binarized = lookup([sample_label])
print(f"Label-binarized representation: {label_binarized}")

Original label: ['cs.CV']

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.

Label-binarized representation: [[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

資料預處理和 `tf.data.Dataset` 物件

我們首先取得序列長度的百分位數估計值。目的稍後會清楚說明。

train_df["summaries"].apply(lambda x: len(x.split(" "))).describe()

count    1604.000000
mean      158.151496
std        41.543130
min        25.000000
25%       130.000000
50%       156.000000
75%       184.250000
max       283.000000
Name: summaries, dtype: float64

請注意，50% 的摘要長度為 154（根據分割，您可能會得到不同的數字）。因此，任何接近該值的數字都足以作為最大序列長度的近似值。

現在，我們實作公用程式來準備我們的資料集。

max_seqlen = 150
batch_size = 128
padding_token = "<pad>"
auto = tf.data.AUTOTUNE


def make_dataset(dataframe, is_train=True):
    labels = tf.ragged.constant(dataframe["terms"].values)
    label_binarized = lookup(labels).numpy()
    dataset = tf.data.Dataset.from_tensor_slices(
        (dataframe["summaries"].values, label_binarized)
    )
    dataset = dataset.shuffle(batch_size * 10) if is_train else dataset
    return dataset.batch(batch_size)

現在我們可以準備 tf.data.Dataset 物件。

train_dataset = make_dataset(train_df, is_train=True)
validation_dataset = make_dataset(val_df, is_train=False)
test_dataset = make_dataset(test_df, is_train=False)

資料集預覽

text_batch, label_batch = next(iter(train_dataset))

for i, text in enumerate(text_batch[:5]):
    label = label_batch[i].numpy()[None, ...]
    print(f"Abstract: {text}")
    print(f"Label(s): {invert_multi_hot(label[0])}")
    print(" ")

Abstract: b"For the Domain Generalization (DG) problem where the hypotheses are composed\nof a common representation function followed by a labeling function, we point\nout a shortcoming in existing approaches that fail to explicitly optimize for a\nterm, appearing in a well-known and widely adopted upper bound to the risk on\nthe unseen domain, that is dependent on the representation to be learned. To\nthis end, we first derive a novel upper bound to the prediction risk. We show\nthat imposing a mild assumption on the representation to be learned, namely\nmanifold restricted invertibility, is sufficient to deal with this issue.\nFurther, unlike existing approaches, our novel upper bound doesn't require the\nassumption of Lipschitzness of the loss function. In addition, the\ndistributional discrepancy in the representation space is handled via the\nWasserstein-2 barycenter cost. In this context, we creatively leverage old and\nrecent transport inequalities, which link various optimal transport metrics, in\nparticular the $L^1$ distance (also known as the total variation distance) and\nthe Wasserstein-2 distances, with the Kullback-Liebler divergence. These\nanalyses and insights motivate a new representation learning cost for DG that\nadditively balances three competing objectives: 1) minimizing classification\nerror across seen domains via cross-entropy, 2) enforcing domain-invariance in\nthe representation space via the Wasserstein-2 barycenter cost, and 3)\npromoting non-degenerate, nearly-invertible representation via one of two\nmechanisms, viz., an autoencoder-based reconstruction loss or a mutual\ninformation loss. It is to be noted that the proposed algorithms completely\nbypass the use of any adversarial training mechanism that is typical of many\ncurrent domain generalization approaches. Simulation results on several\nstandard datasets demonstrate superior performance compared to several\nwell-known DG algorithms."
Label(s): ['cs.LG' 'stat.ML']

Abstract: b'Image segmentation of touching objects plays a key role in providing accurate\nclassification for computer vision technologies. A new line profile based\nimaging segmentation algorithm has been developed to provide a robust and\naccurate segmentation of a group of touching corns. The performance of the line\nprofile based algorithm has been compared to a watershed based imaging\nsegmentation algorithm. Both algorithms are tested on three different patterns\nof images, which are isolated corns, single-lines, and random distributed\nformations. The experimental results show that the algorithm can segment a\nlarge number of touching corn kernels efficiently and accurately.'
Label(s): ['cs.CV']

Abstract: b'Semantic image segmentation is a principal problem in computer vision, where\nthe aim is to correctly classify each individual pixel of an image into a\nsemantic label. Its widespread use in many areas, including medical imaging and\nautonomous driving, has fostered extensive research in recent years. Empirical\nimprovements in tackling this task have primarily been motivated by successful\nexploitation of Convolutional Neural Networks (CNNs) pre-trained for image\nclassification and object recognition. However, the pixel-wise labelling with\nCNNs has its own unique challenges: (1) an accurate deconvolution, or\nupsampling, of low-resolution output into a higher-resolution segmentation mask\nand (2) an inclusion of global information, or context, within locally\nextracted features. To address these issues, we propose a novel architecture to\nconduct the equivalent of the deconvolution operation globally and acquire\ndense predictions. We demonstrate that it leads to improved performance of\nstate-of-the-art semantic segmentation models on the PASCAL VOC 2012 benchmark,\nreaching 74.0% mean IU accuracy on the test set.'
Label(s): ['cs.CV']

Abstract: b'Modern deep learning models have revolutionized the field of computer vision.\nBut, a significant drawback of most of these models is that they require a\nlarge number of labelled examples to generalize properly. Recent developments\nin few-shot learning aim to alleviate this requirement. In this paper, we\npropose a novel lightweight CNN architecture for 1-shot image segmentation. The\nproposed model is created by taking inspiration from well-performing\narchitectures for semantic segmentation and adapting it to the 1-shot domain.\nWe train our model using 4 meta-learning algorithms that have worked well for\nimage classification and compare the results. For the chosen dataset, our\nproposed model has a 70% lower parameter count than the benchmark, while having\nbetter or comparable mean IoU scores using all 4 of the meta-learning\nalgorithms.'
Label(s): ['cs.CV' 'cs.LG' 'eess.IV']

Abstract: b'In this work, we propose CARLS, a novel framework for augmenting the capacity\nof existing deep learning frameworks by enabling multiple components -- model\ntrainers, knowledge makers and knowledge banks -- to concertedly work together\nin an asynchronous fashion across hardware platforms. The proposed CARLS is\nparticularly suitable for learning paradigms where model training benefits from\nadditional knowledge inferred or discovered during training, such as node\nembeddings for graph neural networks or reliable pseudo labels from model\npredictions. We also describe three learning paradigms -- semi-supervised\nlearning, curriculum learning and multimodal learning -- as examples that can\nbe scaled up efficiently by CARLS. One version of CARLS has been open-sourced\nand available for download at:\nhttps://github.com/tensorflow/neural-structured-learning/tree/master/research/carls'
Label(s): ['cs.LG']

向量化

在將資料饋送到模型之前，我們需要將其向量化（以數字形式表示）。為此，我們將使用 TextVectorization 層。它可以作為主模型的一部分運作，以便將模型從核心預處理邏輯中排除。這大大降低了推論期間訓練/服務偏差的可能性。

我們首先計算摘要中存在的唯一字詞數量。

# Source: https://stackoverflow.com/a/18937309/7636462
vocabulary = set()
train_df["summaries"].str.lower().str.split().apply(vocabulary.update)
vocabulary_size = len(vocabulary)
print(vocabulary_size)

我們現在建立我們的向量化層，並 map() 到先前建立的 tf.data.Datasets。

text_vectorizer = layers.TextVectorization(
    max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf"
)

# `TextVectorization` layer needs to be adapted as per the vocabulary from our
# training set.
with tf.device("/CPU:0"):
    text_vectorizer.adapt(train_dataset.map(lambda text, label: text))

train_dataset = train_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)
validation_dataset = validation_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)
test_dataset = test_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

一批原始文本將首先通過 TextVectorization 層，它將產生它們的整數表示形式。在內部，TextVectorization 層將首先從序列中建立雙字母組，然後使用 TF-IDF 表示它們。然後，輸出表示形式將傳遞到負責文本分類的淺層模型。

若要深入瞭解 TextVectorizer 的其他可能組態，請參閱官方文件。

注意：將 max_tokens 引數設定為預先計算的詞彙大小並非必要條件。

建立文本分類模型

我們將保持模型簡單 – 它將是一個由全連接層組成的小堆疊，並以 ReLU 作為非線性。

def make_model():
    shallow_mlp_model = keras.Sequential(
        [
            layers.Dense(512, activation="relu"),
            layers.Dense(256, activation="relu"),
            layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
        ]  # More on why "sigmoid" has been used here in a moment.
    )
    return shallow_mlp_model

訓練模型

我們將使用二元交叉熵損失來訓練我們的模型。這是因為標籤不是不相交的。對於給定的摘要，我們可能有多個類別。因此，我們將預測任務分為一系列多個二元分類問題。這也是為什麼我們將模型中分類層的啟動函數保持為 sigmoid 的原因。研究人員也使用了損失函數和啟動函數的其他組合。例如，在探索弱監督預訓練的極限中，Mahajan 等人使用了 softmax 啟動函數和交叉熵損失來訓練他們的模型。

在多標籤分類中可以使用多種指標選項。為了使此程式碼範例範圍縮小，我們決定使用二元準確度指標。若要瞭解為什麼使用此指標的說明，我們參考此提取請求。還有其他適用於多標籤分類的指標，例如 F1 分數或漢明損失。

epochs = 20

shallow_mlp_model = make_model()
shallow_mlp_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"]
)

history = shallow_mlp_model.fit(
    train_dataset, validation_data=validation_dataset, epochs=epochs
)


def plot_result(item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history["val_" + item], label="val_" + item)
    plt.xlabel("Epochs")
    plt.ylabel(item)
    plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()


plot_result("loss")
plot_result("binary_accuracy")

1/13 ━ [37m━━━━━━━━━━━━━━━━━━━ 26 秒 2 秒/步 - binary_accuracy：0.4491 - loss：1.4007

2/13 ━━━ [37m━━━━━━━━━━━━━━━━━ 3 秒 307 毫秒/步 - binary_accuracy：0.5609 - loss：1.1359

3/13 ━━━━ [37m━━━━━━━━━━━━━━━━ 2 秒 290 毫秒/步 - binary_accuracy：0.6315 - loss：0.9654

4/13 ━━━━━━ [37m━━━━━━━━━━━━━━ 2 秒 286 毫秒/步 - binary_accuracy：0.6785 - loss：0.8508

5/13 ━━━━━━━ [37m━━━━━━━━━━━━━ 2 秒 282 毫秒/步 - binary_accuracy：0.7128 - loss：0.7661

6/13 ━━━━━━━━━ [37m━━━━━━━━━━━ 1 秒 283 毫秒/步 - binary_accuracy：0.7391 - loss：0.7006

7/13 ━━━━━━━━━━ [37m━━━━━━━━━━ 1 秒 277 毫秒/步 - binary_accuracy：0.7600 - loss：0.6485

8/13 ━━━━━━━━━━━━ [37m━━━━━━━━ 1 秒 275 毫秒/步 - binary_accuracy：0.7770 - loss：0.6054

9/13 ━━━━━━━━━━━━━ [37m━━━━━━━ 1 秒 272 毫秒/步 - binary_accuracy：0.7913 - loss：0.5693

10/13 ━━━━━━━━━━━━━━━ [37m━━━━━ 0 秒 270 毫秒/步 - binary_accuracy：0.8033 - loss：0.5389

11/13 ━━━━━━━━━━━━━━━━ [37m━━━━ 0 秒 272 毫秒/步 - binary_accuracy：0.8136 - loss：0.5127

12/13 ━━━━━━━━━━━━━━━━━━ [37m━━ 0 秒 273 毫秒/步 - binary_accuracy：0.8225 - loss：0.4899

13/13 ━━━━━━━━━━━━━━━━━━━━ 0 秒 363 毫秒/步 - binary_accuracy：0.8303 - loss：0.4702

13/13 ━━━━━━━━━━━━━━━━━━━━ 7 秒 402 毫秒/步 - binary_accuracy：0.8369 - loss：0.4532 - val_binary_accuracy：0.9782 - val_loss：0.0867

png

在訓練時，我們注意到損失起初急劇下降，然後逐漸衰減。

評估模型

_, binary_acc = shallow_mlp_model.evaluate(test_dataset)
print(f"Categorical accuracy on the test set: {round(binary_acc * 100, 2)}%.")

1/1 ━━━━━━━━━━━━━━━━━━━━ 0 秒 483 毫秒/步 - binary_accuracy：0.9734 - loss：0.0927

1/1 ━━━━━━━━━━━━━━━━━━━━ 0 秒 486 毫秒/步 - binary_accuracy：0.9734 - loss：0.0927

Categorical accuracy on the test set: 97.34%.

經過訓練的模型給了我們約 99% 的評估準確度。

推論

Keras 提供的預處理層的一個重要功能是，它們可以包含在 tf.keras.Model 內部。我們將通過在 shallow_mlp_model 之上包含 text_vectorization 層來匯出推論模型。這將允許我們的推論模型直接對原始字串進行操作。

請注意，在訓練期間，始終最好將這些預處理層用作資料輸入管線的一部分，而不是模型的一部分，以避免使硬體加速器的瓶頸浮出水面。這也允許非同步資料處理。

# We create a custom Model to override the predict method so
# that it first vectorizes text data
class ModelEndtoEnd(keras.Model):

    def predict(self, inputs):
        indices = text_vectorizer(inputs)
        return super().predict(indices)


def get_inference_model(model):
    inputs = shallow_mlp_model.inputs
    outputs = shallow_mlp_model.outputs
    end_to_end_model = ModelEndtoEnd(inputs, outputs, name="end_to_end_model")
    end_to_end_model.compile(
        optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
    )
    return end_to_end_model


model_for_inference = get_inference_model(shallow_mlp_model)

# Create a small dataset just for demonstrating inference.
inference_dataset = make_dataset(test_df.sample(2), is_train=False)
text_batch, label_batch = next(iter(inference_dataset))
predicted_probabilities = model_for_inference.predict(text_batch)


# Perform inference.
for i, text in enumerate(text_batch[:5]):
    label = label_batch[i].numpy()[None, ...]
    print(f"Abstract: {text}")
    print(f"Label(s): {invert_multi_hot(label[0])}")
    predicted_proba = [proba for proba in predicted_probabilities[i]]
    top_3_labels = [
        x
        for _, x in sorted(
            zip(predicted_probabilities[i], lookup.get_vocabulary()),
            key=lambda pair: pair[0],
            reverse=True,
        )
    ][:3]
    print(f"Predicted Label(s): ({', '.join([label for label in top_3_labels])})")
    print(" ")

1/1 ━━━━━━━━━━━━━━━━━━━━ 0 秒 141 毫秒/步

1/1 ━━━━━━━━━━━━━━━━━━━━ 0 秒 142 毫秒/步

Abstract: b'High-resolution image segmentation remains challenging and error-prone due to\nthe enormous size of intermediate feature maps. Conventional methods avoid this\nproblem by using patch based approaches where each patch is segmented\nindependently. However, independent patch segmentation induces errors,\nparticularly at the patch boundary due to the lack of contextual information in\nvery high-resolution images where the patch size is much smaller compared to\nthe full image. To overcome these limitations, in this paper, we propose a\nnovel framework to segment a particular patch by incorporating contextual\ninformation from its neighboring patches. This allows the segmentation network\nto see the target patch with a wider field of view without the need of larger\nfeature maps. Comparative analysis from a number of experiments shows that our\nproposed framework is able to segment high resolution images with significantly\nimproved mean Intersection over Union and overall accuracy.'
Label(s): ['cs.CV']
Predicted Label(s): (cs.CV, eess.IV, cs.LG)

Abstract: b"Convolutional neural networks for visual recognition require large amounts of\ntraining samples and usually benefit from data augmentation. This paper\nproposes PatchMix, a data augmentation method that creates new samples by\ncomposing patches from pairs of images in a grid-like pattern. These new\nsamples' ground truth labels are set as proportional to the number of patches\nfrom each image. We then add a set of additional losses at the patch-level to\nregularize and to encourage good representations at both the patch and image\nlevels. A ResNet-50 model trained on ImageNet using PatchMix exhibits superior\ntransfer learning capabilities across a wide array of benchmarks. Although\nPatchMix can rely on random pairings and random grid-like patterns for mixing,\nwe explore evolutionary search as a guiding strategy to discover optimal\ngrid-like patterns and image pairing jointly. For this purpose, we conceive a\nfitness function that bypasses the need to re-train a model to evaluate each\nchoice. In this way, PatchMix outperforms a base model on CIFAR-10 (+1.91),\nCIFAR-100 (+5.31), Tiny Imagenet (+3.52), and ImageNet (+1.16) by significant\nmargins, also outperforming previous state-of-the-art pairwise augmentation\nstrategies."
Label(s): ['cs.CV' 'cs.LG' 'cs.NE']
Predicted Label(s): (cs.CV, cs.LG, stat.ML)


/home/humbulani/tensorflow-env/env/lib/python3.11/site-packages/keras/src/models/functional.py:252: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: ['keras_tensor_2']
Received: inputs=Tensor(shape=(2, 20498))
  warnings.warn(msg)

預測結果並非非常好，但對於像我們這樣簡單的模型來說，也並非低於標準。我們可以通過考慮單詞順序的模型（如 LSTM）甚至使用 Transformers 的模型 (Vaswani 等人) 來提高此效能。