► 程式碼範例 / 結構化資料 / 使用 FeatureSpace 進行結構化資料分類

使用 FeatureSpace 進行結構化資料分類

作者： fchollet
建立日期 2022/11/09
最後修改日期 2022/11/09
描述： 用幾行程式碼分類表格資料。

ⓘ 此範例使用 Keras 3

在 Colab 中檢視 • GitHub 原始碼

簡介

此範例示範如何從原始 CSV 檔案開始進行結構化資料分類（也稱為表格資料分類）。我們的資料包括數值特徵、整數類別特徵和字串類別特徵。我們將使用 keras.utils.FeatureSpace 公用程式來索引、預處理和編碼我們的特徵。

此程式碼改編自範例從頭開始進行結構化資料分類。雖然先前的範例使用 Keras 預處理層來管理自己的底層特徵預處理和編碼，但在本範例中，我們將所有內容委託給 FeatureSpace，使工作流程非常快速且簡單。

資料集

我們的資料集由克里夫蘭醫學基金會提供，用於心臟病研究。它是一個包含 303 列的 CSV 檔案。每一列包含一位病患的資訊（一個樣本），而每一欄描述病患的屬性（一個特徵）。我們使用這些特徵來預測病患是否患有心臟病（二元分類）。

以下是每個特徵的描述

欄位	描述	特徵類型
年齡	以年為單位的年齡	數值
性別	(1 = 男性；0 = 女性)	類別
CP	胸痛類型 (0, 1, 2, 3, 4)	類別
Trestbpd	靜息血壓（以毫米汞柱為單位）	數值
Chol	血清膽固醇，單位為毫克/分升	數值
FBS	空腹血糖，單位為 120 毫克/分升（1 = 是；0 = 否）	類別
RestECG	靜息心電圖結果 (0, 1, 2)	類別
Thalach	達到的最大心率	數值
Exang	運動引起的心絞痛 (1 = 是；0 = 否)	類別
Oldpeak	相對於靜息，運動引起的 ST 波段下降	數值
Slope	運動高峰期 ST 波段的斜率	數值
CA	以螢光鏡染色的大血管數量 (0-3)	數值和類別
Thal	3 = 正常；6 = 固定缺陷；7 = 可逆缺陷	類別
目標	心臟病診斷（1 = 是；0 = 否）	目標

設定

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import pandas as pd
import keras
from keras.utils import FeatureSpace

準備資料

讓我們下載資料並將其載入到 Pandas 資料框架中

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

此資料集包含 303 個樣本，每個樣本包含 14 欄（13 個特徵和目標標籤）

print(dataframe.shape)

(303, 14)

以下是一些樣本的預覽

dataframe.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	63	1	1	145	233	1	2	150	0	2.3	3	0	fixed	0
1	67	1	4	160	286	0	2	108	1	1.5	2	3	normal	1
2	67	1	4	120	229	0	2	129	1	2.6	2	2	reversible	0
3	37	1	3	130	250	0	0	187	0	3.5	3	0	normal	0
4	41	0	2	130	204	0	2	172	0	1.4	1	0	normal	0

最後一欄「target」表示病患是否患有心臟病（1）或沒有患病（0）。

讓我們將資料分割成訓練集和驗證集

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation

讓我們為每個資料框架產生 tf.data.Dataset 物件

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

每個 Dataset 會產生一個元組 (input, target)，其中 input 是特徵的字典，而 target 是值 0 或 1

for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'age': <tf.Tensor: shape=(), dtype=int64, numpy=65>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=138>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=282>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=174>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.4>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'normal'>}
Target: tf.Tensor(0, shape=(), dtype=int64)

讓我們批次處理資料集

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

設定 `FeatureSpace`

若要設定每個特徵應如何預處理，我們建立一個 keras.utils.FeatureSpace 的實例，並將字典傳遞給它，該字典將我們的特徵名稱對應到描述特徵類型的字串。

我們有一些「整數類別」特徵，例如 "FBS"，一個「字串類別」特徵 ("thal")，以及一些數值特徵，我們想要將其正規化 – 但 "age" 除外，我們想要將其離散化為多個組距。

我們也使用 crosses 引數來擷取某些類別特徵的特徵交互作用，也就是說，為這些類別特徵建立代表值同時出現的其他特徵。您可以像這樣為任意類別特徵集計算特徵交叉，而不僅僅是兩個特徵的元組。由於產生的同時出現會雜湊到固定大小的向量中，因此您不必擔心同時出現的空間是否太大。

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": "integer_categorical",
        "cp": "integer_categorical",
        "fbs": "integer_categorical",
        "restecg": "integer_categorical",
        "exang": "integer_categorical",
        "ca": "integer_categorical",
        # Categorical feature encoded as string
        "thal": "string_categorical",
        # Numerical features to discretize
        "age": "float_discretized",
        # Numerical features to normalize
        "trestbps": "float_normalized",
        "chol": "float_normalized",
        "thalach": "float_normalized",
        "oldpeak": "float_normalized",
        "slope": "float_normalized",
    },
    # We create additional features by hashing
    # value co-occurrences for the
    # following groups of categorical features.
    crosses=[("sex", "age"), ("thal", "ca")],
    # The hashing space for these co-occurrences
    # wil be 32-dimensional.
    crossing_dim=32,
    # Our utility will one-hot encode all categorical
    # features and concat all features into a single
    # vector (one vector per sample).
    output_mode="concat",
)

進一步自訂 `FeatureSpace`

透過字串名稱指定特徵類型既快速又簡單，但有時您可能想要進一步設定每個特徵的預處理。例如，在我們的例子中，我們的類別特徵沒有大量的可能值 – 每個特徵只有少數幾個值（例如，特徵 "FBS" 的 1 和 0），並且所有可能的值都表示在訓練集中。因此，我們不需要保留索引來表示這些特徵的「詞彙表外」值 – 這本來會是預設行為。以下，我們僅在每個這些特徵中指定 num_oov_indices=0，以告知特徵預處理器跳過「詞彙表外」索引。

您可以使用的其他自訂包括指定離散化類型為 "float_discretized" 的特徵的組距數量，或是特徵交叉的雜湊空間維度。

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": FeatureSpace.integer_categorical(num_oov_indices=0),
        "cp": FeatureSpace.integer_categorical(num_oov_indices=0),
        "fbs": FeatureSpace.integer_categorical(num_oov_indices=0),
        "restecg": FeatureSpace.integer_categorical(num_oov_indices=0),
        "exang": FeatureSpace.integer_categorical(num_oov_indices=0),
        "ca": FeatureSpace.integer_categorical(num_oov_indices=0),
        # Categorical feature encoded as string
        "thal": FeatureSpace.string_categorical(num_oov_indices=0),
        # Numerical features to discretize
        "age": FeatureSpace.float_discretized(num_bins=30),
        # Numerical features to normalize
        "trestbps": FeatureSpace.float_normalized(),
        "chol": FeatureSpace.float_normalized(),
        "thalach": FeatureSpace.float_normalized(),
        "oldpeak": FeatureSpace.float_normalized(),
        "slope": FeatureSpace.float_normalized(),
    },
    # Specify feature cross with a custom crossing dim.
    crosses=[
        FeatureSpace.cross(feature_names=("sex", "age"), crossing_dim=64),
        FeatureSpace.cross(
            feature_names=("thal", "ca"),
            crossing_dim=16,
        ),
    ],
    output_mode="concat",
)

將 `FeatureSpace` 適應於訓練資料

在我們開始使用 FeatureSpace 來建置模型之前，我們必須將其適應於訓練資料。在 adapt() 期間，FeatureSpace 將

為類別特徵的可能值集建立索引。
計算數值特徵的平均值和變異數以進行正規化。
計算數值特徵要離散化的不同組距的值界限。

請注意，adapt() 應在產生特徵值字典的 tf.data.Dataset 上呼叫，而不是標籤。

train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

此時，可以在原始特徵值的字典上呼叫 FeatureSpace，並會為每個樣本傳回一個單一串聯向量，結合編碼的特徵和特徵交叉。

for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print("preprocessed_x.shape:", preprocessed_x.shape)
    print("preprocessed_x.dtype:", preprocessed_x.dtype)

preprocessed_x.shape: (32, 138)
preprocessed_x.dtype: <dtype: 'float32'>

管理預處理的兩種方式：作為 `tf.data` 管道的一部分，或在模型本身中

您可以使用 FeatureSpace 的兩種方式

`tf.data` 中的非同步預處理

您可以將其設為資料管道的一部分，在模型之前。這可讓 CPU 上資料的非同步平行預處理在資料到達模型之前進行。如果您在 GPU 或 TPU 上進行訓練，或是想要加快預處理速度，請執行此操作。通常，這在訓練期間都是正確的作法。

模型中的同步預處理

您可以將其設為模型的一部分。這表示模型會預期原始特徵值的字典，且預處理批次會在其餘前向傳遞之前同步完成（以封鎖方式）。如果您想要擁有可以處理原始特徵值的端對端模型，請執行此操作 – 但請記住，您的模型只能在 CPU 上執行，因為大多數類型的特徵預處理（例如，字串預處理）不與 GPU 或 TPU 相容。

請勿在 GPU/TPU 上或在對效能敏感的設定中執行此操作。一般而言，您會在 CPU 上執行推論時執行模型內預處理。

在我們的案例中，我們將在訓練期間將 FeatureSpace 套用至 tf.data 管道中，但我們將使用包含 FeatureSpace 的端對端模型進行推論。

讓我們建立預處理批次的訓練和驗證資料集

preprocessed_train_ds = train_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_train_ds = preprocessed_train_ds.prefetch(tf.data.AUTOTUNE)

preprocessed_val_ds = val_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_val_ds = preprocessed_val_ds.prefetch(tf.data.AUTOTUNE)

建置模型

是時候建置模型了 – 或者更確切地說是兩個模型

預期預處理特徵的訓練模型（一個樣本 = 一個向量）
預期原始特徵的推論模型（一個樣本 = 原始特徵值的字典）

dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

x = keras.layers.Dense(32, activation="relu")(encoded_features)
x = keras.layers.Dropout(0.5)(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

training_model = keras.Model(inputs=encoded_features, outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

inference_model = keras.Model(inputs=dict_inputs, outputs=predictions)

訓練模型

讓我們訓練我們的模型 50 個 epoch。請注意，特徵預處理是作為 tf.data 管道的一部分發生，而不是作為模型的一部分。

training_model.fit(
    preprocessed_train_ds,
    epochs=20,
    validation_data=preprocessed_val_ds,
    verbose=2,
)

Epoch 1/20
8/8 - 3s - 352ms/step - accuracy: 0.5200 - loss: 0.7407 - val_accuracy: 0.6196 - val_loss: 0.6663
Epoch 2/20
8/8 - 0s - 20ms/step - accuracy: 0.5881 - loss: 0.6874 - val_accuracy: 0.7732 - val_loss: 0.6015
Epoch 3/20
8/8 - 0s - 19ms/step - accuracy: 0.6580 - loss: 0.6192 - val_accuracy: 0.7839 - val_loss: 0.5577
Epoch 4/20
8/8 - 0s - 19ms/step - accuracy: 0.7096 - loss: 0.5721 - val_accuracy: 0.7856 - val_loss: 0.5200
Epoch 5/20
8/8 - 0s - 18ms/step - accuracy: 0.7292 - loss: 0.5553 - val_accuracy: 0.7764 - val_loss: 0.4853
Epoch 6/20
8/8 - 0s - 19ms/step - accuracy: 0.7561 - loss: 0.5103 - val_accuracy: 0.7732 - val_loss: 0.4627
Epoch 7/20
8/8 - 0s - 19ms/step - accuracy: 0.7231 - loss: 0.5374 - val_accuracy: 0.7764 - val_loss: 0.4413
Epoch 8/20
8/8 - 0s - 19ms/step - accuracy: 0.7769 - loss: 0.4564 - val_accuracy: 0.7683 - val_loss: 0.4320
Epoch 9/20
8/8 - 0s - 18ms/step - accuracy: 0.7769 - loss: 0.4324 - val_accuracy: 0.7856 - val_loss: 0.4191
Epoch 10/20
8/8 - 0s - 19ms/step - accuracy: 0.7778 - loss: 0.4340 - val_accuracy: 0.7888 - val_loss: 0.4084
Epoch 11/20
8/8 - 0s - 19ms/step - accuracy: 0.7760 - loss: 0.4124 - val_accuracy: 0.7716 - val_loss: 0.3977
Epoch 12/20
8/8 - 0s - 19ms/step - accuracy: 0.7964 - loss: 0.4125 - val_accuracy: 0.7667 - val_loss: 0.3959
Epoch 13/20
8/8 - 0s - 18ms/step - accuracy: 0.8051 - loss: 0.3979 - val_accuracy: 0.7856 - val_loss: 0.3891
Epoch 14/20
8/8 - 0s - 19ms/step - accuracy: 0.8043 - loss: 0.3891 - val_accuracy: 0.7856 - val_loss: 0.3840
Epoch 15/20
8/8 - 0s - 18ms/step - accuracy: 0.8633 - loss: 0.3571 - val_accuracy: 0.7872 - val_loss: 0.3764
Epoch 16/20
8/8 - 0s - 19ms/step - accuracy: 0.8728 - loss: 0.3548 - val_accuracy: 0.7888 - val_loss: 0.3699
Epoch 17/20
8/8 - 0s - 19ms/step - accuracy: 0.8698 - loss: 0.3171 - val_accuracy: 0.7872 - val_loss: 0.3727
Epoch 18/20
8/8 - 0s - 18ms/step - accuracy: 0.8529 - loss: 0.3454 - val_accuracy: 0.7904 - val_loss: 0.3669
Epoch 19/20
8/8 - 0s - 17ms/step - accuracy: 0.8589 - loss: 0.3359 - val_accuracy: 0.7980 - val_loss: 0.3770
Epoch 20/20
8/8 - 0s - 17ms/step - accuracy: 0.8455 - loss: 0.3113 - val_accuracy: 0.8044 - val_loss: 0.3684

<keras.src.callbacks.history.History at 0x7f139bb4ed10>

我們很快就達到 80% 的驗證準確度。

使用端對端模型對新資料進行推論

現在，我們可以按照以下方式使用我們的推論模型（包含 FeatureSpace）根據原始特徵值的字典進行預測

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)

print(
    f"This particular patient had a {100 * predictions[0][0]:.2f}% probability "
    "of having a heart disease, as evaluated by our model."
)

 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 273ms/step
This particular patient had a 43.13% probability of having a heart disease, as evaluated by our model.

使用 FeatureSpace 進行結構化資料分類

◆ 簡介

資料集

◆ 設定

◆ 準備資料

◆ 設定 FeatureSpace

◆ 進一步自訂 FeatureSpace

◆ 將 FeatureSpace 適應於訓練資料

◆ 兩種管理預處理的方式：作為 tf.data 管道的一部分，或在模型本身中進行

在 tf.data 中進行非同步預處理

模型中的同步預處理

◆ 建立模型

◆ 訓練模型

◆ 使用端對端模型對新資料進行推論

使用 FeatureSpace 進行結構化資料分類

簡介

資料集

設定

準備資料

設定 FeatureSpace

進一步自訂 FeatureSpace

將 FeatureSpace 適應於訓練資料

管理預處理的兩種方式：作為 tf.data 管道的一部分，或在模型本身中

tf.data 中的非同步預處理

模型中的同步預處理

建置模型

訓練模型

使用端對端模型對新資料進行推論

設定 `FeatureSpace`

進一步自訂 `FeatureSpace`

將 `FeatureSpace` 適應於訓練資料

管理預處理的兩種方式：作為 `tf.data` 管道的一部分，或在模型本身中

`tf.data` 中的非同步預處理