Keras 3 API 文件 / 工具 / 結構化資料預處理工具

結構化資料預處理工具

[原始碼]

FeatureSpace 類別

keras.utils.FeatureSpace(
    features,
    output_mode="concat",
    crosses=None,
    crossing_dim=32,
    hashing_dim=32,
    num_discretization_bins=32,
    name=None,
)

用於預處理和編碼結構化資料的一站式工具。

參數

  • feature_names:一個字典,將您的特徵名稱對應到它們的類型規範,例如 {"my_feature": "integer_categorical"}{"my_feature": FeatureSpace.integer_categorical()}。如需所有支援類型的完整列表,請參閱下面的「可用特徵類型」段落。
  • output_mode"concat""dict" 其中之一。在 concat 模式下,所有特徵會被串聯成一個單一向量。在 dict 模式下,FeatureSpace 會返回一個個別編碼特徵的字典(與輸入字典的鍵相同)。
  • crosses:要交叉組合的特徵列表,例如 crosses=[("feature_1", "feature_2")]。這些特徵將通過將它們的組合值雜湊到固定長度的向量中來「交叉」。
  • crossing_dim:雜湊交叉特徵的預設向量大小。預設值為 32
  • hashing_dim:雜湊 "integer_hashed""string_hashed" 類型特徵的預設向量大小。預設值為 32
  • num_discretization_bins:用於離散化 "float_discretized" 類型特徵的預設分箱數量。預設值為 32

可用特徵類型

請注意,所有特徵都可以使用其字串名稱來參考,例如 "integer_categorical"。當使用字串名稱時,將使用預設的參數值。

# Plain float values.
FeatureSpace.float(name=None)

# Float values to be preprocessed via featurewise standardization
# (i.e. via a [`keras.layers.Normalization`](/api/layers/preprocessing_layers/numerical/normalization#normalization-class) layer).
FeatureSpace.float_normalized(name=None)

# Float values to be preprocessed via linear rescaling
# (i.e. via a [`keras.layers.Rescaling`](/api/layers/preprocessing_layers/image_preprocessing/rescaling#rescaling-class) layer).
FeatureSpace.float_rescaled(scale=1., offset=0., name=None)

# Float values to be discretized. By default, the discrete
# representation will then be one-hot encoded.
FeatureSpace.float_discretized(
    num_bins, bin_boundaries=None, output_mode="one_hot", name=None)

# Integer values to be indexed. By default, the discrete
# representation will then be one-hot encoded.
FeatureSpace.integer_categorical(
    max_tokens=None, num_oov_indices=1, output_mode="one_hot", name=None)

# String values to be indexed. By default, the discrete
# representation will then be one-hot encoded.
FeatureSpace.string_categorical(
    max_tokens=None, num_oov_indices=1, output_mode="one_hot", name=None)

# Integer values to be hashed into a fixed number of bins.
# By default, the discrete representation will then be one-hot encoded.
FeatureSpace.integer_hashed(num_bins, output_mode="one_hot", name=None)

# String values to be hashed into a fixed number of bins.
# By default, the discrete representation will then be one-hot encoded.
FeatureSpace.string_hashed(num_bins, output_mode="one_hot", name=None)

範例

使用輸入資料字典的基本用法

raw_data = {
    "float_values": [0.0, 0.1, 0.2, 0.3],
    "string_values": ["zero", "one", "two", "three"],
    "int_values": [0, 1, 2, 3],
}
dataset = tf.data.Dataset.from_tensor_slices(raw_data)

feature_space = FeatureSpace(
    features={
        "float_values": "float_normalized",
        "string_values": "string_categorical",
        "int_values": "integer_categorical",
    },
    crosses=[("string_values", "int_values")],
    output_mode="concat",
)
# Before you start using the FeatureSpace,
# you must `adapt()` it on some data.
feature_space.adapt(dataset)

# You can call the FeatureSpace on a dict of data (batched or unbatched).
output_vector = feature_space(raw_data)

使用 tf.data 的基本用法

# Unlabeled data
preprocessed_ds = unlabeled_dataset.map(feature_space)

# Labeled data
preprocessed_ds = labeled_dataset.map(lambda x, y: (feature_space(x), y))

使用 Keras Functional API 的基本用法

# Retrieve a dict Keras Input objects
inputs = feature_space.get_inputs()
# Retrieve the corresponding encoded Keras tensors
encoded_features = feature_space.get_encoded_features()
# Build a Functional model
outputs = keras.layers.Dense(1, activation="sigmoid")(encoded_features)
model = keras.Model(inputs, outputs)

自訂每個特徵或特徵交叉

feature_space = FeatureSpace(
    features={
        "float_values": FeatureSpace.float_normalized(),
        "string_values": FeatureSpace.string_categorical(max_tokens=10),
        "int_values": FeatureSpace.integer_categorical(max_tokens=10),
    },
    crosses=[
        FeatureSpace.cross(("string_values", "int_values"), crossing_dim=32)
    ],
    output_mode="concat",
)

返回整數編碼特徵的字典

feature_space = FeatureSpace(
    features={
        "string_values": FeatureSpace.string_categorical(output_mode="int"),
        "int_values": FeatureSpace.integer_categorical(output_mode="int"),
    },
    crosses=[
        FeatureSpace.cross(
            feature_names=("string_values", "int_values"),
            crossing_dim=32,
            output_mode="int",
        )
    ],
    output_mode="dict",
)

指定您自己的 Keras 預處理層

# Let's say that one of the features is a short text paragraph that
# we want to encode as a vector (one vector per paragraph) via TF-IDF.
data = {
    "text": ["1st string", "2nd string", "3rd string"],
}

# There's a Keras layer for this: TextVectorization.
custom_layer = layers.TextVectorization(output_mode="tf_idf")

# We can use FeatureSpace.feature to create a custom feature
# that will use our preprocessing layer.
feature_space = FeatureSpace(
    features={
        "text": FeatureSpace.feature(
            preprocessor=custom_layer, dtype="string", output_mode="float"
        ),
    },
    output_mode="concat",
)
feature_space.adapt(tf.data.Dataset.from_tensor_slices(data))
output_vector = feature_space(data)

檢索底層的 Keras 預處理層

# The preprocessing layer of each feature is available in `.preprocessors`.
preprocessing_layer = feature_space.preprocessors["feature1"]

# The crossing layer of each feature cross is available in `.crossers`.
# It's an instance of keras.layers.HashedCrossing.
crossing_layer = feature_space.crossers["feature1_X_feature2"]

儲存和重新載入 FeatureSpace

feature_space.save("featurespace.keras")
reloaded_feature_space = keras.models.load_model("featurespace.keras")