► Keras 3 API 文件 / 內建小型資料集 / IMDB 電影評論情感分類資料集

IMDB 電影評論情感分類資料集

`load_data` 函數

keras.datasets.imdb.load_data(
    path="imdb.npz",
    num_words=None,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3,
    **kwargs
)

載入 IMDB 資料集。

這是一個來自 IMDB 的 25,000 筆電影評論資料集，依情感 (正面/負面) 標記。評論已經過預處理，每則評論都編碼為單字索引 (整數) 列表。為了方便起見，單字依其在資料集中整體頻率進行索引，因此例如，整數 "3" 編碼資料中第 3 個最常見的單字。這允許快速篩選操作，例如：「僅考慮前 10,000 個最常見的單字，但排除前 20 個最常見的單字」。

作為慣例，「0」不代表特定單字，而是用於編碼填充符號。

引數

path：快取資料的位置 (相對於 ~/.keras/dataset)。
num_words：整數或 None。單字依其出現頻率 (在訓練集中) 排序，且僅保留 num_words 個最常見的單字。任何較不常見的單字將在序列資料中顯示為 oov_char 值。如果為 None，則保留所有單字。預設為 None。
skip_top：跳過前 N 個最常出現的單字 (可能不具資訊性)。這些單字將在資料集中顯示為 oov_char 值。當為 0 時，不跳過任何單字。預設為 0。
maxlen：整數或 None。最大序列長度。任何更長的序列都將被截斷。None 表示不截斷。預設為 None。
seed：整數。用於可重現資料洗牌的種子。
start_char：整數。序列的開頭將以此字元標記。0 通常是填充字元。預設為 1。
oov_char：整數。詞彙外字元。由於 num_words 或 skip_top 限制而被刪除的單字將以此字元替換。
index_from：整數。以此索引及更高的索引為實際單字編製索引。

回傳

Numpy 陣列的元組：(x_train, y_train), (x_test, y_test)。

x_train、x_test：序列列表，為索引 (整數) 列表。如果指定了 num_words 引數，則最大可能的索引值為 num_words - 1。如果指定了 maxlen 引數，則最大可能的序列長度為 maxlen。

y_train、y_test：整數標籤列表 (1 或 0)。

注意：「詞彙外」字元僅用於訓練集中存在，但由於未達到此處的 num_words 截斷而未包含的單字。在訓練集中未見過但在測試集中出現的單字已被簡單地跳過。

[來源]

`get_word_index` 函數

keras.datasets.imdb.get_word_index(path="imdb_word_index.json")

檢索將單字對應到其在 IMDB 資料集中索引的字典。

引數

path：快取資料的位置 (相對於 ~/.keras/dataset)。

回傳

單字索引字典。鍵為單字字串，值為其索引。

範例

# Use the default parameters to keras.datasets.imdb.load_data
start_char = 1
oov_char = 2
index_from = 3
# Retrieve the training sequences.
(x_train, _), _ = keras.datasets.imdb.load_data(
    start_char=start_char, oov_char=oov_char, index_from=index_from
)
# Retrieve the word index file mapping words to indices
word_index = keras.datasets.imdb.get_word_index()
# Reverse the word index to obtain a dict mapping indices to words
# And add `index_from` to indices to sync with `x_train`
inverted_word_index = dict(
    (i + index_from, word) for (word, i) in word_index.items()
)
# Update `inverted_word_index` to include `start_char` and `oov_char`
inverted_word_index[start_char] = "[START]"
inverted_word_index[oov_char] = "[OOV]"
# Decode the first sequence in the dataset
decoded_sequence = " ".join(inverted_word_index[i] for i in x_train[0])

IMDB 電影評論情感分類資料集

load_data 函數

get_word_index 函數

IMDB 電影評論情感分類資料集

load_data 函數

get_word_index 函數

`load_data` 函數

`get_word_index` 函數