作者: Khalid Salama
建立日期 2020/12/30
上次修改時間 2020/12/30
說明: 使用行為序列 Transformer (BST) 模型在 Movielens 上進行評分預測。
此範例使用 Movielens 資料集 展示由 Qiwei Chen 等人提出的 行為序列 Transformer (BST) 模型。BST 模型利用使用者觀看和評分電影的順序行為,以及使用者個人資料和電影特徵,來預測使用者對目標電影的評分。
更精確地說,BST 模型旨在透過接受以下輸入來預測目標電影的評分
movie_ids
的固定長度序列。ratings
的固定長度序列。user_id
、sex
、occupation
和 age_group
。genres
。target_movie_id
。此範例在以下方面修改了原始的 BST 模型
請注意,此範例應使用 TensorFlow 2.4 或更高版本執行。
我們使用 Movielens 資料集的 1M 版本。該資料集包含約 100 萬條來自 6000 名使用者對 4000 部電影的評分,以及一些使用者特徵和電影類型。此外,還提供了每個使用者-電影評分的時間戳記,允許為每個使用者創建電影評分序列,如 BST 模型預期的那樣。
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import math
from zipfile import ZipFile
from urllib.request import urlretrieve
import keras
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import layers
from keras.layers import StringLookup
首先,讓我們下載 movielens 資料。
下載的資料夾將包含三個資料檔案:users.dat
、movies.dat
和 ratings.dat
。
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")
ZipFile("movielens.zip", "r").extractall()
然後,我們將資料載入到具有適當欄位名稱的 pandas DataFrames 中。
users = pd.read_csv(
"ml-1m/users.dat",
sep="::",
names=["user_id", "sex", "age_group", "occupation", "zip_code"],
encoding="ISO-8859-1",
engine="python",
)
ratings = pd.read_csv(
"ml-1m/ratings.dat",
sep="::",
names=["user_id", "movie_id", "rating", "unix_timestamp"],
encoding="ISO-8859-1",
engine="python",
)
movies = pd.read_csv(
"ml-1m/movies.dat",
sep="::",
names=["movie_id", "title", "genres"],
encoding="ISO-8859-1",
engine="python",
)
在這裡,我們進行一些簡單的資料處理以修正欄位的資料類型。
users["user_id"] = users["user_id"].apply(lambda x: f"user_{x}")
users["age_group"] = users["age_group"].apply(lambda x: f"group_{x}")
users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{x}")
movies["movie_id"] = movies["movie_id"].apply(lambda x: f"movie_{x}")
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{x}")
ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{x}")
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))
每部電影都有多種類型。我們將它們分成 movies
DataFrame 中的單獨欄位。
genres = ["Action", "Adventure", "Animation", "Children's", "Comedy", "Crime"]
genres += ["Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical"]
genres += ["Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
for genre in genres:
movies[genre] = movies["genres"].apply(
lambda values: int(genre in values.split("|"))
)
首先,讓我們使用 unix_timestamp
對評分資料進行排序,然後按 user_id
對 movie_id
值和 rating
值進行分組。
輸出 DataFrame 將為每個 user_id
包含一個記錄,其中包含兩個有序列表(按評分日期時間排序):他們評分過的電影以及他們對這些電影的評分。
ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")
ratings_data = pd.DataFrame(
data={
"user_id": list(ratings_group.groups.keys()),
"movie_ids": list(ratings_group.movie_id.apply(list)),
"ratings": list(ratings_group.rating.apply(list)),
"timestamps": list(ratings_group.unix_timestamp.apply(list)),
}
)
現在,讓我們將 movie_ids
列表拆分為一組固定長度的序列。我們對 ratings
也做同樣的處理。設定 sequence_length
變數以更改模型輸入序列的長度。您還可以更改 step_size
以控制為每個使用者生成的序列數量。
sequence_length = 4
step_size = 2
def create_sequences(values, window_size, step_size):
sequences = []
start_index = 0
while True:
end_index = start_index + window_size
seq = values[start_index:end_index]
if len(seq) < window_size:
seq = values[-window_size:]
if len(seq) == window_size:
sequences.append(seq)
break
sequences.append(seq)
start_index += step_size
return sequences
ratings_data.movie_ids = ratings_data.movie_ids.apply(
lambda ids: create_sequences(ids, sequence_length, step_size)
)
ratings_data.ratings = ratings_data.ratings.apply(
lambda ids: create_sequences(ids, sequence_length, step_size)
)
del ratings_data["timestamps"]
之後,我們處理輸出,使每個序列在 DataFrame 中都有單獨的記錄。此外,我們將使用者特徵與評分資料結合起來。
ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode(
"movie_ids", ignore_index=True
)
ratings_data_rating = ratings_data[["ratings"]].explode("ratings", ignore_index=True)
ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating], axis=1)
ratings_data_transformed = ratings_data_transformed.join(
users.set_index("user_id"), on="user_id"
)
ratings_data_transformed.movie_ids = ratings_data_transformed.movie_ids.apply(
lambda x: ",".join(x)
)
ratings_data_transformed.ratings = ratings_data_transformed.ratings.apply(
lambda x: ",".join([str(v) for v in x])
)
del ratings_data_transformed["zip_code"]
ratings_data_transformed.rename(
columns={"movie_ids": "sequence_movie_ids", "ratings": "sequence_ratings"},
inplace=True,
)
當 sequence_length
為 4 且 step_size
為 2 時,我們最終得到 498,623 個序列。
最後,我們將資料分為訓練和測試集,分別包含 85% 和 15% 的實例,並將它們儲存到 CSV 檔案中。
random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85
train_data = ratings_data_transformed[random_selection]
test_data = ratings_data_transformed[~random_selection]
train_data.to_csv("train_data.csv", index=False, sep="|", header=False)
test_data.to_csv("test_data.csv", index=False, sep="|", header=False)
CSV_HEADER = list(ratings_data_transformed.columns)
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
"user_id": list(users.user_id.unique()),
"movie_id": list(movies.movie_id.unique()),
"sex": list(users.sex.unique()),
"age_group": list(users.age_group.unique()),
"occupation": list(users.occupation.unique()),
}
USER_FEATURES = ["sex", "age_group", "occupation"]
MOVIE_FEATURES = ["genres"]
tf.data.Dataset
def get_dataset_from_csv(csv_file_path, shuffle=False, batch_size=128):
def process(features):
movie_ids_string = features["sequence_movie_ids"]
sequence_movie_ids = tf.strings.split(movie_ids_string, ",").to_tensor()
# The last movie id in the sequence is the target movie.
features["target_movie_id"] = sequence_movie_ids[:, -1]
features["sequence_movie_ids"] = sequence_movie_ids[:, :-1]
ratings_string = features["sequence_ratings"]
sequence_ratings = tf.strings.to_number(
tf.strings.split(ratings_string, ","), tf.dtypes.float32
).to_tensor()
# The last rating in the sequence is the target for the model to predict.
target = sequence_ratings[:, -1]
features["sequence_ratings"] = sequence_ratings[:, :-1]
return features, target
dataset = tf.data.experimental.make_csv_dataset(
csv_file_path,
batch_size=batch_size,
column_names=CSV_HEADER,
num_epochs=1,
header=False,
field_delim="|",
shuffle=shuffle,
).map(process)
return dataset
def create_model_inputs():
return {
"user_id": keras.Input(name="user_id", shape=(1,), dtype="string"),
"sequence_movie_ids": keras.Input(
name="sequence_movie_ids", shape=(sequence_length - 1,), dtype="string"
),
"target_movie_id": keras.Input(
name="target_movie_id", shape=(1,), dtype="string"
),
"sequence_ratings": keras.Input(
name="sequence_ratings", shape=(sequence_length - 1,), dtype=tf.float32
),
"sex": keras.Input(name="sex", shape=(1,), dtype="string"),
"age_group": keras.Input(name="age_group", shape=(1,), dtype="string"),
"occupation": keras.Input(name="occupation", shape=(1,), dtype="string"),
}
encode_input_features
方法的工作原理如下
每個類別使用者特徵都使用 layers.Embedding
進行編碼,嵌入維度等於特徵詞彙量大小的平方根。這些特徵的嵌入被連接起來形成一個單一的輸入張量。
電影序列中的每部電影和目標電影都使用 layers.Embedding
進行編碼,其中維度大小是電影數量的平方根。
每部電影的多熱類型向量与其嵌入向量串聯,並使用非線性 layers.Dense
進行處理,以輸出具有相同電影嵌入維度的向量。
將位置嵌入添加到序列中每個電影嵌入中,然後乘以其來自評分序列的評分。
目標電影嵌入與序列電影嵌入串聯,產生形狀為 [批次大小,序列長度,嵌入大小]
的張量,如 Transformer 架構的注意力層所預期的那樣。
此方法會回傳一個包含兩個元素的 Tuple:encoded_transformer_features
和 encoded_other_features
。
def encode_input_features(
inputs,
include_user_id=True,
include_user_features=True,
include_movie_features=True,
):
encoded_transformer_features = []
encoded_other_features = []
other_feature_names = []
if include_user_id:
other_feature_names.append("user_id")
if include_user_features:
other_feature_names.extend(USER_FEATURES)
## Encode user features
for feature_name in other_feature_names:
# Convert the string input values into integer indices.
vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
idx = StringLookup(vocabulary=vocabulary, mask_token=None, num_oov_indices=0)(
inputs[feature_name]
)
# Compute embedding dimensions
embedding_dims = int(math.sqrt(len(vocabulary)))
# Create an embedding layer with the specified dimensions.
embedding_encoder = layers.Embedding(
input_dim=len(vocabulary),
output_dim=embedding_dims,
name=f"{feature_name}_embedding",
)
# Convert the index values to embedding representations.
encoded_other_features.append(embedding_encoder(idx))
## Create a single embedding vector for the user features
if len(encoded_other_features) > 1:
encoded_other_features = layers.concatenate(encoded_other_features)
elif len(encoded_other_features) == 1:
encoded_other_features = encoded_other_features[0]
else:
encoded_other_features = None
## Create a movie embedding encoder
movie_vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]
movie_embedding_dims = int(math.sqrt(len(movie_vocabulary)))
# Create a lookup to convert string values to integer indices.
movie_index_lookup = StringLookup(
vocabulary=movie_vocabulary,
mask_token=None,
num_oov_indices=0,
name="movie_index_lookup",
)
# Create an embedding layer with the specified dimensions.
movie_embedding_encoder = layers.Embedding(
input_dim=len(movie_vocabulary),
output_dim=movie_embedding_dims,
name=f"movie_embedding",
)
# Create a vector lookup for movie genres.
genre_vectors = movies[genres].to_numpy()
movie_genres_lookup = layers.Embedding(
input_dim=genre_vectors.shape[0],
output_dim=genre_vectors.shape[1],
embeddings_initializer=keras.initializers.Constant(genre_vectors),
trainable=False,
name="genres_vector",
)
# Create a processing layer for genres.
movie_embedding_processor = layers.Dense(
units=movie_embedding_dims,
activation="relu",
name="process_movie_embedding_with_genres",
)
## Define a function to encode a given movie id.
def encode_movie(movie_id):
# Convert the string input values into integer indices.
movie_idx = movie_index_lookup(movie_id)
movie_embedding = movie_embedding_encoder(movie_idx)
encoded_movie = movie_embedding
if include_movie_features:
movie_genres_vector = movie_genres_lookup(movie_idx)
encoded_movie = movie_embedding_processor(
layers.concatenate([movie_embedding, movie_genres_vector])
)
return encoded_movie
## Encoding target_movie_id
target_movie_id = inputs["target_movie_id"]
encoded_target_movie = encode_movie(target_movie_id)
## Encoding sequence movie_ids.
sequence_movies_ids = inputs["sequence_movie_ids"]
encoded_sequence_movies = encode_movie(sequence_movies_ids)
# Create positional embedding.
position_embedding_encoder = layers.Embedding(
input_dim=sequence_length,
output_dim=movie_embedding_dims,
name="position_embedding",
)
positions = tf.range(start=0, limit=sequence_length - 1, delta=1)
encodded_positions = position_embedding_encoder(positions)
# Retrieve sequence ratings to incorporate them into the encoding of the movie.
sequence_ratings = inputs["sequence_ratings"]
sequence_ratings = keras.ops.expand_dims(sequence_ratings, -1)
# Add the positional encoding to the movie encodings and multiply them by rating.
encoded_sequence_movies_with_poistion_and_rating = layers.Multiply()(
[(encoded_sequence_movies + encodded_positions), sequence_ratings]
)
# Construct the transformer inputs.
for i in range(sequence_length - 1):
feature = encoded_sequence_movies_with_poistion_and_rating[:, i, ...]
feature = keras.ops.expand_dims(feature, 1)
encoded_transformer_features.append(feature)
encoded_transformer_features.append(encoded_target_movie)
encoded_transformer_features = layers.concatenate(
encoded_transformer_features, axis=1
)
return encoded_transformer_features, encoded_other_features
include_user_id = False
include_user_features = False
include_movie_features = False
hidden_units = [256, 128]
dropout_rate = 0.1
num_heads = 3
def create_model():
inputs = create_model_inputs()
transformer_features, other_features = encode_input_features(
inputs, include_user_id, include_user_features, include_movie_features
)
# Create a multi-headed attention layer.
attention_output = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=transformer_features.shape[2], dropout=dropout_rate
)(transformer_features, transformer_features)
# Transformer block.
attention_output = layers.Dropout(dropout_rate)(attention_output)
x1 = layers.Add()([transformer_features, attention_output])
x1 = layers.LayerNormalization()(x1)
x2 = layers.LeakyReLU()(x1)
x2 = layers.Dense(units=x2.shape[-1])(x2)
x2 = layers.Dropout(dropout_rate)(x2)
transformer_features = layers.Add()([x1, x2])
transformer_features = layers.LayerNormalization()(transformer_features)
features = layers.Flatten()(transformer_features)
# Included the other features.
if other_features is not None:
features = layers.concatenate(
[features, layers.Reshape([other_features.shape[-1]])(other_features)]
)
# Fully-connected layers.
for num_units in hidden_units:
features = layers.Dense(num_units)(features)
features = layers.BatchNormalization()(features)
features = layers.LeakyReLU()(features)
features = layers.Dropout(dropout_rate)(features)
outputs = layers.Dense(units=1)(features)
model = keras.Model(inputs=inputs, outputs=outputs)
return model
model = create_model()
# Compile the model.
model.compile(
optimizer=keras.optimizers.Adagrad(learning_rate=0.01),
loss=keras.losses.MeanSquaredError(),
metrics=[keras.metrics.MeanAbsoluteError()],
)
# Read the training data.
train_dataset = get_dataset_from_csv("train_data.csv", shuffle=True, batch_size=265)
# Fit the model with the training data.
model.fit(train_dataset, epochs=5)
# Read the test data.
test_dataset = get_dataset_from_csv("test_data.csv", batch_size=265)
# Evaluate the model on the test data.
_, rmse = model.evaluate(test_dataset, verbose=0)
print(f"Test MAE: {round(rmse, 3)}")
Epoch 1/5
1600/1600 ━━━━━━━━━━━━━━━━━━━━ 19s 11ms/step - loss: 1.5762 - mean_absolute_error: 0.9892
Epoch 2/5
1600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.1263 - mean_absolute_error: 0.8502
Epoch 3/5
1600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.0885 - mean_absolute_error: 0.8361
Epoch 4/5
1600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.0943 - mean_absolute_error: 0.8388
Epoch 5/5
1600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 10ms/step - loss: 1.0360 - mean_absolute_error: 0.8142
Test MAE: 0.782
您應該在測試數據上達到或接近 0.7 的平均絕對誤差 (MAE)。
BST 模型在其架構中使用 Transformer 層來擷取使用者行為序列背後的順序訊號,以進行推薦。
您可以嘗試使用不同的配置來訓練此模型,例如,增加輸入序列長度並訓練模型更多個週期。此外,您可以嘗試加入其他特徵,例如電影發行年份和客戶郵遞區號,以及加入交叉特徵,例如性別 X 類型。