► Keras 3 API 文件 / 層 API / 預處理層 / 文本預處理 / TextVectorization 層

TextVectorization 層

`TextVectorization` 類別

keras.layers.TextVectorization(
    max_tokens=None,
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    ngrams=None,
    output_mode="int",
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    idf_weights=None,
    sparse=False,
    ragged=False,
    encoding="utf-8",
    name=None,
    **kwargs
)

一個預處理層，用於將文本特徵映射到整數序列。

此層具有管理 Keras 模型中文字的基本選項。它將一批字串（一個範例 = 一個字串）轉換為詞符索引列表（一個範例 = 整數詞符索引的 1D 張量）或密集表示（一個範例 = 代表範例詞符相關資料的浮點數值的 1D 張量）。此層旨在處理自然語言輸入。若要處理簡單的字串輸入（類別字串或預先詞符化的字串），請參閱 kers_core.layers.StringLookup。

此層的詞彙表必須在建構時提供，或透過 adapt() 學習。當此層被調整時，它將分析資料集，確定個別字串值的頻率，並從中建立詞彙表。此詞彙表可以具有無限大小或被限制，具體取決於此層的配置選項；如果輸入中的唯一值多於最大詞彙表大小，則將使用最常見的詞語來建立詞彙表。

每個範例的處理包含以下步驟

標準化每個範例（通常是小寫化 + 移除標點符號）
將每個範例分割成子字串（通常是單字）
將子字串重新組合為詞符（通常是 n 元語法）
索引詞符（將唯一的整數值與每個詞符關聯）
使用此索引轉換每個範例，轉換為整數向量或密集浮點向量。

關於傳遞可調用物件以自訂此層的分割和標準化的注意事項

任何可調用物件都可以傳遞到此層，但如果您想序列化此物件，則應僅傳遞已註冊為 Keras 可序列化的函數（詳情請參閱 keras.saving.register_keras_serializable）。
當為 standardize 使用自訂可調用物件時，可調用物件接收的資料將與傳遞到此層的資料完全相同。可調用物件應傳回與輸入形狀相同的張量。
當為 split 使用自訂可調用物件時，可調用物件接收的資料將擠出第一個維度 - 而不是 [["string to split"], ["another string to split"]]，可調用物件將看到 ["string to split", "another string to split"]。可調用物件應傳回 dtype 為 string 的 tf.Tensor，其中第一個維度包含分割後的詞符 - 在此範例中，我們應該看到類似 [["string", "to", "split"], ["another", "string", "to", "split"]] 的內容。

注意： 此層在內部使用 TensorFlow。它不能作為編譯計算圖的一部分與 TensorFlow 以外的任何後端一起使用。但是，在急切執行時，它可以與任何後端一起使用。它也始終可以用作任何後端的輸入預處理管道的一部分（在模型本身之外），這是我們建議使用此層的方式。

注意： 此層可以安全地在 tf.data 管道內使用（獨立於您使用的後端）。

參數

max_tokens：此層詞彙表的最大大小。這僅應在調整詞彙表或設定 pad_to_max_tokens=True 時指定。請注意，此詞彙表包含 1 個 OOV 詞符，因此有效詞符數為 (max_tokens - 1 - (1 if output_mode == "int" else 0))。
standardize：用於指定應用於輸入文本的標準化的可選參數。值可以是
- None：無標準化。
- "lower_and_strip_punctuation"：文本將被小寫化並移除所有標點符號。
- "lower"：文本將被小寫化。
- "strip_punctuation"：所有標點符號將被移除。
- Callable：輸入將傳遞給可調用函數，該函數應被標準化並傳回。
split：用於指定分割輸入文本的可選參數。值可以是
- None：無分割。
- "whitespace"：以空白字元分割。
- "character"：以每個 Unicode 字元分割。
- Callable：標準化後的輸入將傳遞給可調用函數，該函數應被分割並傳回。
ngrams：用於指定從可能分割的輸入文本建立 n 元語法的可選參數。值可以是 None、整數或整數元組；傳遞整數將建立最多到該整數的 n 元語法，而傳遞整數元組將為元組中指定的值建立 n 元語法。傳遞 None 表示不會建立 n 元語法。
output_mode：用於指定層輸出的可選參數。值可以是 "int"、"multi_hot"、"count" 或 "tf_idf"，將層配置如下
- "int"：輸出整數索引，每個分割字串詞符一個整數索引。當 output_mode == "int" 時，0 保留給遮罩位置；這會將詞彙表大小縮減為 max_tokens - 2 而不是 max_tokens - 1。
- "multi_hot"：為每個批次輸出一個整數陣列，大小為 vocab_size 或 max_tokens，其中包含 1s 在所有元素中，詞符映射到該索引的詞符在批次項目中至少存在一次。
- "count"：類似於 "multi_hot"，但整數陣列包含該索引處的詞符在批次項目中出現次數的計數。
- "tf_idf"：類似於 "multi_hot"，但 TF-IDF 演算法應用於尋找每個詞符槽中的值。對於 "int" 輸出，支援任何形狀的輸入和輸出。對於所有其他輸出模式，目前僅支援 rank 1 輸入（和分割後的 rank 2 輸出）。
output_sequence_length：僅在 INT 模式下有效。如果設定，則輸出將使其時間維度填充或截斷為正好 output_sequence_length 個值，從而產生形狀為 (batch_size, output_sequence_length) 的張量，而與分割步驟產生的詞符數量無關。預設為 None。如果 ragged 為 True，則 output_sequence_length 仍可能截斷輸出。
pad_to_max_tokens：僅在 "multi_hot"、"count" 和 "tf_idf" 模式下有效。如果 True，則輸出將使其特徵軸填充到 max_tokens，即使詞彙表中唯一詞符的數量小於 max_tokens，從而產生形狀為 (batch_size, max_tokens) 的張量，而與詞彙表大小無關。預設為 False。
vocabulary：可選。可以是字串陣列或指向文字檔案的字串路徑。如果傳遞陣列，則可以傳遞包含字串詞彙表詞語的元組、列表、1D NumPy 陣列或 1D 張量。如果傳遞檔案路徑，則檔案應包含詞彙表中每個詞語一行。如果設定此參數，則無需 adapt() 層。
idf_weights：僅當 output_mode 為 "tf_idf" 時有效。與詞彙表長度相同的元組、列表、1D NumPy 陣列或 1D 張量，其中包含浮點數反向文件頻率權重，它將與每個樣本詞語計數相乘，以獲得最終的 tf_idf 權重。如果設定了 vocabulary 參數，且 output_mode 為 "tf_idf"，則必須提供此參數。
ragged：布林值。僅適用於 "int" 輸出模式。僅 TensorFlow 後端支援。如果 True，則傳回 RaggedTensor 而不是密集 Tensor，其中每個序列在字串分割後可能具有不同的長度。預設為 False。
sparse：布林值。僅適用於 "multi_hot"、"count" 和 "tf_idf" 輸出模式。僅 TensorFlow 後端支援。如果 True，則傳回 SparseTensor 而不是密集 Tensor。預設為 False。
encoding：可選。用於解釋輸入字串的文本編碼。預設為 "utf-8"。

範例

此範例實例化一個 TextVectorization 層，該層將文本小寫化、以空白字元分割、移除標點符號，並輸出整數詞彙表索引。

>>> max_tokens = 5000  # Maximum vocab size.
>>> max_len = 4  # Sequence length to pad the outputs to.
>>> # Create the layer.
>>> vectorize_layer = TextVectorization(
...     max_tokens=max_tokens,
...     output_mode='int',
...     output_sequence_length=max_len)

>>> # Now that the vocab layer has been created, call `adapt` on the
>>> # list of strings to create the vocabulary.
>>> vectorize_layer.adapt(["foo bar", "bar baz", "baz bada boom"])

>>> # Now, the layer can map strings to integers -- you can use an
>>> # embedding layer to map these integers to learned embeddings.
>>> input_data = [["foo qux bar"], ["qux baz"]]
>>> vectorize_layer(input_data)
array([[4, 1, 3, 0],
       [1, 2, 0, 0]])

此範例透過將詞彙表詞語列表傳遞給層的 __init__() 方法來實例化 TextVectorization 層。

>>> vocab_data = ["earth", "wind", "and", "fire"]
>>> max_len = 4  # Sequence length to pad the outputs to.
>>> # Create the layer, passing the vocab directly. You can also pass the
>>> # vocabulary arg a path to a file containing one vocabulary word per
>>> # line.
>>> vectorize_layer = keras.layers.TextVectorization(
...     max_tokens=max_tokens,
...     output_mode='int',
...     output_sequence_length=max_len,
...     vocabulary=vocab_data)

>>> # Because we've passed the vocabulary directly, we don't need to adapt
>>> # the layer - the vocabulary is already set. The vocabulary contains the
>>> # padding token ('') and OOV token ('[UNK]')
>>> # as well as the passed tokens.
>>> vectorize_layer.get_vocabulary()
['', '[UNK]', 'earth', 'wind', 'and', 'fire']

TextVectorization 層

TextVectorization 類別

TextVectorization 層

TextVectorization 類別

`TextVectorization` 類別