作者: achoum
建立日期 2024/02/01
上次修改日期 2024/02/01
描述: 使用 Temporian 和前饋神經網路偵測支付卡詐欺交易。
此筆記本依賴 Keras 3、Temporian 和一些其他程式庫。您可以如下安裝它們
pip install temporian keras pandas tf-nightly scikit-learn -U
import keras # To train the Machine Learning model
import temporian as tp # To convert transactions into tabular data
import numpy as np
import os
import pandas as pd
import datetime
import math
import tensorflow as tf
from sklearn.metrics import RocCurveDisplay
支付詐欺偵測對於銀行、企業和消費者至關重要。僅在歐洲,詐欺交易估計在 2019 年為 18.9 億歐元。在全球範圍內,大約 3.6% 的商業收入因詐欺而損失。在本筆記本中,我們訓練並評估一個模型,以使用 Le Borgne 等人撰寫的書籍 Reproducible Machine Learning for Credit Card Fraud Detection 隨附的合成資料集來偵測詐欺交易。
詐欺交易通常無法透過單獨查看交易來偵測。相反地,詐欺交易是透過查看來自同一使用者、同一商家或具有其他類型關係的多筆交易的模式來偵測的。為了以機器學習模型可以理解的方式表達這些關係,並透過特徵工程來擴充特徵,我們使用 Temporian 預處理程式庫。
我們將交易資料集預處理為表格資料集,並使用前饋神經網路來學習詐欺模式並進行預測。
資料集包含 2018 年 4 月 1 日至 2018 年 9 月 30 日之間採樣的支付交易。交易儲存在 CSV 檔案中,每天一個檔案。
注意: 下載資料集需要約 1 分鐘。
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 9, 30)
# Load the dataset as a Pandas dataframe.
cache_path = "fraud_detection_cache.csv"
if not os.path.exists(cache_path):
print("Download dataset")
dataframes = []
num_files = (end_date - start_date).days
counter = 0
while start_date <= end_date:
if counter % (num_files // 10) == 0:
print(f"[{100 * (counter+1) // num_files}%]", end="", flush=True)
print(".", end="", flush=True)
url = f"https://github.com/Fraud-Detection-Handbook/simulated-data-raw/raw/6e67dbd0a3bfe0d7ec33abc4bce5f37cd4ff0d6a/data/{start_date}.pkl"
dataframes.append(pd.read_pickle(url))
start_date += datetime.timedelta(days=1)
counter += 1
print("done", flush=True)
transactions_dataframe = pd.concat(dataframes)
transactions_dataframe.to_csv(cache_path, index=False)
else:
print("Load dataset from cache")
transactions_dataframe = pd.read_csv(
cache_path, dtype={"CUSTOMER_ID": bytes, "TERMINAL_ID": bytes}
)
print(f"Found {len(transactions_dataframe)} transactions")
Download dataset
[0%]..................[10%]..................[20%]..................[30%]..................[40%]..................[50%]..................[59%]..................[69%]..................[79%]..................[89%]..................[99%]...done
Found 1754155 transactions
每筆交易都由單一列表示,具有以下感興趣的欄位
transactions_dataframe = transactions_dataframe[
["TX_DATETIME", "CUSTOMER_ID", "TERMINAL_ID", "TX_AMOUNT", "TX_FRAUD"]
]
transactions_dataframe.head(4)
TX_DATETIME | CUSTOMER_ID | TERMINAL_ID | TX_AMOUNT | TX_FRAUD | |
---|---|---|---|---|---|
0 | 2018-04-01 00:00:31 | 596 | 3156 | 57.16 | 0 |
1 | 2018-04-01 00:02:10 | 4961 | 3412 | 81.51 | 0 |
2 | 2018-04-01 00:07:56 | 2 | 1365 | 146.00 | 0 |
3 | 2018-04-01 00:09:29 | 4128 | 8737 | 64.49 | 0 |
資料集高度不平衡,大多數交易都是合法的。
fraudulent_rate = transactions_dataframe["TX_FRAUD"].mean()
print("Rate of fraudulent transactions:", fraudulent_rate)
Rate of fraudulent transactions: 0.008369271814634397
pandas dataframe 被轉換為 Temporian EventSet,更適合用於後續步驟的資料探索和特徵預處理。
transactions_evset = tp.from_pandas(transactions_dataframe, timestamps="TX_DATETIME")
transactions_evset
WARNING:root:Feature "CUSTOMER_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "TERMINAL_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
時間戳記 | CUSTOMER_ID | TERMINAL_ID | TX_AMOUNT | TX_FRAUD |
---|---|---|---|---|
2018-04-01 00:00:31+00:00 | 596 | 3156 | 57.16 | 0 |
2018-04-01 00:02:10+00:00 | 4961 | 3412 | 81.51 | 0 |
2018-04-01 00:07:56+00:00 | 2 | 1365 | 146 | 0 |
2018-04-01 00:09:29+00:00 | 4128 | 8737 | 64.49 | 0 |
2018-04-01 00:10:34+00:00 | 927 | 9906 | 50.99 | 0 |
… | … | … | … | … |
可以繪製整個資料集,但產生的圖表將難以閱讀。相反地,我們可以按客戶對交易進行分組。
transactions_evset.add_index("CUSTOMER_ID").plot(indexes="3774")
請注意此客戶的少數詐欺交易。
無法單獨偵測詐欺交易。相反地,我們需要連接相關交易。對於每筆交易,我們計算過去 n
天內同一終端機的交易總和和計數。因為我們不知道 n
的正確值,所以我們使用 n
的多個值,並為每個值計算一組特徵。
# Group the transactions per terminal
transactions_per_terminal = transactions_evset.add_index("TERMINAL_ID")
# Moving statistics per terminal
tmp_features = []
for n in [7, 14, 28]:
tmp_features.append(
transactions_per_terminal["TX_AMOUNT"]
.moving_sum(tp.duration.days(n))
.rename(f"sum_transactions_{n}_days")
)
tmp_features.append(
transactions_per_terminal.moving_count(tp.duration.days(n)).rename(
f"count_transactions_{n}_days"
)
)
feature_set_1 = tp.glue(*tmp_features)
feature_set_1
時間戳記 | sum_transactions_7_days | count_transactions_7_days | sum_transactions_14_days | count_transactions_14_days | sum_transactions_28_days | count_transactions_28_days |
---|---|---|---|---|---|---|
2018-04-02 01:00:01+00:00 | 16.07 | 1 | 16.07 | 1 | 16.07 | 1 |
2018-04-02 09:49:55+00:00 | 83.9 | 2 | 83.9 | 2 | 83.9 | 2 |
2018-04-03 12:14:41+00:00 | 110.7 | 3 | 110.7 | 3 | 110.7 | 3 |
2018-04-05 16:47:41+00:00 | 151.2 | 4 | 151.2 | 4 | 151.2 | 4 |
2018-04-07 06:05:21+00:00 | 199.6 | 5 | 199.6 | 5 | 199.6 | 5 |
… | … | … | … | … | … | … |
時間戳記 | sum_transactions_7_days | count_transactions_7_days | sum_transactions_14_days | count_transactions_14_days | sum_transactions_28_days | count_transactions_28_days |
---|---|---|---|---|---|---|
2018-04-01 16:24:39+00:00 | 70.36 | 1 | 70.36 | 1 | 70.36 | 1 |
2018-04-02 11:25:03+00:00 | 87.79 | 2 | 87.79 | 2 | 87.79 | 2 |
2018-04-04 08:31:48+00:00 | 211.6 | 3 | 211.6 | 3 | 211.6 | 3 |
2018-04-04 14:15:28+00:00 | 315 | 4 | 315 | 4 | 315 | 4 |
2018-04-04 20:54:17+00:00 | 446.5 | 5 | 446.5 | 5 | 446.5 | 5 |
… | … | … | … | … | … | … |
時間戳記 | sum_transactions_7_days | count_transactions_7_days | sum_transactions_14_days | count_transactions_14_days | sum_transactions_28_days | count_transactions_28_days |
---|---|---|---|---|---|---|
2018-04-01 14:11:55+00:00 | 2.9 | 1 | 2.9 | 1 | 2.9 | 1 |
2018-04-02 11:01:07+00:00 | 17.04 | 2 | 17.04 | 2 | 17.04 | 2 |
2018-04-03 13:46:58+00:00 | 118.2 | 3 | 118.2 | 3 | 118.2 | 3 |
2018-04-04 03:27:11+00:00 | 161.7 | 4 | 161.7 | 4 | 161.7 | 4 |
2018-04-05 17:58:10+00:00 | 171.3 | 5 | 171.3 | 5 | 171.3 | 5 |
… | … | … | … | … | … | … |
時間戳記 | sum_transactions_7_days | count_transactions_7_days | sum_transactions_14_days | count_transactions_14_days | sum_transactions_28_days | count_transactions_28_days |
---|---|---|---|---|---|---|
2018-04-02 10:37:42+00:00 | 6.31 | 1 | 6.31 | 1 | 6.31 | 1 |
2018-04-04 19:14:23+00:00 | 12.26 | 2 | 12.26 | 2 | 12.26 | 2 |
2018-04-07 04:01:22+00:00 | 65.12 | 3 | 65.12 | 3 | 65.12 | 3 |
2018-04-07 12:18:27+00:00 | 112.4 | 4 | 112.4 | 4 | 112.4 | 4 |
2018-04-07 21:11:03+00:00 | 170.4 | 5 | 170.4 | 5 | 170.4 | 5 |
… | … | … | … | … | … | … |
讓我們看看終端機 "3774" 的特徵。
feature_set_1.plot(indexes="3774")
交易的詐欺狀態在交易時未知(否則就沒有問題了)。但是,銀行會在交易完成後一週知道交易是否為詐欺。我們建立一組特徵,指示過去 N 天內詐欺交易的數量和比率。
# Lag the transactions by one week.
lagged_transactions = transactions_per_terminal.lag(tp.duration.weeks(1))
# Moving statistics per customer
tmp_features = []
for n in [7, 14, 28]:
tmp_features.append(
lagged_transactions["TX_FRAUD"]
.moving_sum(tp.duration.days(n), sampling=transactions_per_terminal)
.rename(f"count_fraud_transactions_{n}_days")
)
tmp_features.append(
lagged_transactions["TX_FRAUD"]
.cast(tp.float32)
.simple_moving_average(tp.duration.days(n), sampling=transactions_per_terminal)
.rename(f"rate_fraud_transactions_{n}_days")
)
feature_set_2 = tp.glue(*tmp_features)
交易日期和時間可能與詐欺相關。雖然每筆交易都有時間戳記,但機器學習模型可能難以直接使用它們。相反地,我們從時間戳記中提取各種資訊豐富的日曆特徵,例如小時、星期幾(例如,星期一、星期二)和月份中的日期 (1-31)。
feature_set_3 = tp.glue(
transactions_per_terminal.calendar_hour(),
transactions_per_terminal.calendar_day_of_week(),
)
最後,我們將所有特徵和標籤分組在一起。
all_data = tp.glue(
transactions_per_terminal, feature_set_1, feature_set_2, feature_set_3
).drop_index()
print("All the available features:")
all_data.schema.feature_names()
All the available features:
['CUSTOMER_ID',
'TX_AMOUNT',
'TX_FRAUD',
'sum_transactions_7_days',
'count_transactions_7_days',
'sum_transactions_14_days',
'count_transactions_14_days',
'sum_transactions_28_days',
'count_transactions_28_days',
'count_fraud_transactions_7_days',
'rate_fraud_transactions_7_days',
'count_fraud_transactions_14_days',
'rate_fraud_transactions_14_days',
'count_fraud_transactions_28_days',
'rate_fraud_transactions_28_days',
'calendar_hour',
'calendar_day_of_week',
'TERMINAL_ID']
我們提取輸入特徵的名稱。
input_feature_names = [k for k in all_data.schema.feature_names() if k.islower()]
print("The model's input features:")
input_feature_names
The model's input features:
['sum_transactions_7_days',
'count_transactions_7_days',
'sum_transactions_14_days',
'count_transactions_14_days',
'sum_transactions_28_days',
'count_transactions_28_days',
'count_fraud_transactions_7_days',
'rate_fraud_transactions_7_days',
'count_fraud_transactions_14_days',
'rate_fraud_transactions_14_days',
'count_fraud_transactions_28_days',
'rate_fraud_transactions_28_days',
'calendar_hour',
'calendar_day_of_week']
為了讓神經網路正常運作,數值輸入必須標準化。一種常見的方法是應用 z 標準化,這涉及從訓練資料中估計的平均值和標準差中減去每個值。在預測中,不建議使用此類 z 標準化,因為它會導致未來洩漏。具體來說,為了對時間 t 的交易進行分類,我們不能依賴時間 t 之後的資料,因為在服務時間在時間 t 進行預測時,尚無後續資料可用。簡而言之,在時間 t,我們僅限於使用時間 t 之前或與時間 t 同時發生的資料。
因此,解決方案是隨著時間推移應用 z 標準化,這表示我們使用該交易過去資料計算的平均值和標準差來標準化每筆交易。
未來洩漏是有害的。幸運的是,Temporian 在這裡提供協助:唯一可能導致未來洩漏的運算子是 EventSet.leak()
。如果您沒有使用 EventSet.leak()
,則保證您的預處理不會產生未來洩漏。
注意: 對於進階管線,您也可以程式化地檢查特徵是否不依賴 EventSet.leak()
運算。
# Cast all values (e.g. ints) to floats.
values = all_data[input_feature_names].cast(tp.float32)
# Apply z-normalization overtime.
normalized_features = (
values - values.simple_moving_average(math.inf)
) / values.moving_standard_deviation(math.inf)
# Restore the original name of the features.
normalized_features = normalized_features.rename(values.schema.feature_names())
print(normalized_features)
indexes: []
features: [('sum_transactions_7_days', float32), ('count_transactions_7_days', float32), ('sum_transactions_14_days', float32), ('count_transactions_14_days', float32), ('sum_transactions_28_days', float32), ('count_transactions_28_days', float32), ('count_fraud_transactions_7_days', float32), ('rate_fraud_transactions_7_days', float32), ('count_fraud_transactions_14_days', float32), ('rate_fraud_transactions_14_days', float32), ('count_fraud_transactions_28_days', float32), ('rate_fraud_transactions_28_days', float32), ('calendar_hour', float32), ('calendar_day_of_week', float32)]
events:
(1754155 events):
timestamps: ['2018-04-01T00:00:31' '2018-04-01T00:02:10' '2018-04-01T00:07:56' ...
'2018-09-30T23:58:21' '2018-09-30T23:59:52' '2018-09-30T23:59:57']
'sum_transactions_7_days': [ 0. 1. 1.3636 ... -0.064 -0.2059 0.8428]
'count_transactions_7_days': [ nan nan nan ... 1.0128 0.6892 1.66 ]
'sum_transactions_14_days': [ 0. 1. 1.3636 ... -0.7811 0.156 1.379 ]
'count_transactions_14_days': [ nan nan nan ... 0.2969 0.2969 2.0532]
'sum_transactions_28_days': [ 0. 1. 1.3636 ... -0.7154 -0.2989 1.9396]
'count_transactions_28_days': [ nan nan nan ... 0.1172 -0.1958 1.8908]
'count_fraud_transactions_7_days': [ nan nan nan ... -0.1043 -0.1043 -0.1043]
'rate_fraud_transactions_7_days': [ nan nan nan ... -0.1137 -0.1137 -0.1137]
'count_fraud_transactions_14_days': [ nan nan nan ... -0.1133 -0.1133 0.9303]
'rate_fraud_transactions_14_days': [ nan nan nan ... -0.1216 -0.1216 0.5275]
...
memory usage: 112.3 MB
/home/gbm/my_venv/lib/python3.11/site-packages/temporian/implementation/numpy/operators/binary/arithmetic.py:100: RuntimeWarning: invalid value encountered in divide
return evset_1_feature / evset_2_feature
由於之前的交易很少,因此第一個交易將使用平均值和標準差的差估計值進行標準化。為了減輕此問題,我們從訓練資料集中刪除第一週的資料。
請注意,第一個值包含 NaN。在 Temporian 中,NaN 代表遺失值,所有運算子都會相應地處理它們。例如,在計算移動平均值時,NaN 值不包含在計算中,也不會產生 NaN 結果。
但是,神經網路無法原生處理 NaN 值。因此,我們將它們替換為零。
normalized_features = normalized_features.fillna(0.0)
最後,我們將特徵和標籤分組在一起。
normalized_all_data = tp.glue(normalized_features, all_data["TX_FRAUD"])
為了評估我們的機器學習模型的品質,我們需要訓練集、驗證集和測試集。由於系統是動態的(新的詐欺模式一直在建立),因此訓練集在驗證集之前,驗證集在測試集之前非常重要
為了讓範例執行更快,我們將有效縮減訓練集的大小:- 訓練: 2018 年 7 月 1 日至 2018 年 7 月 31 日
# begin_train = datetime.datetime(2018, 4, 8).timestamp() # Full training dataset
begin_train = datetime.datetime(2018, 7, 1).timestamp() # Reduced training dataset
begin_valid = datetime.datetime(2018, 8, 1).timestamp()
begin_test = datetime.datetime(2018, 9, 1).timestamp()
is_train = (normalized_all_data.timestamps() >= begin_train) & (
normalized_all_data.timestamps() < begin_valid
)
is_valid = (normalized_all_data.timestamps() >= begin_valid) & (
normalized_all_data.timestamps() < begin_test
)
is_test = normalized_all_data.timestamps() >= begin_test
is_train
、is_valid
和 is_test
是隨時間推移的布林特徵,指示樹狀摺疊的限制。讓我們繪製它們。
tp.plot(
[
is_train.rename("is_train"),
is_valid.rename("is_valid"),
is_test.rename("is_test"),
]
)
我們在每個摺疊中篩選輸入特徵和標籤。
train_ds_evset = normalized_all_data.filter(is_train)
valid_ds_evset = normalized_all_data.filter(is_valid)
test_ds_evset = normalized_all_data.filter(is_test)
print(f"Training examples: {train_ds_evset.num_events()}")
print(f"Validation examples: {valid_ds_evset.num_events()}")
print(f"Testing examples: {test_ds_evset.num_events()}")
Training examples: 296924
Validation examples: 296579
Testing examples: 288064
在計算特徵之後分割資料集非常重要,因為訓練資料集的一些特徵是從訓練視窗期間的交易計算出來的。
我們將資料集從 EventSet 轉換為 TensorFlow 資料集,因為 Keras 原生使用它們。
non_batched_train_ds = tp.to_tensorflow_dataset(train_ds_evset)
non_batched_valid_ds = tp.to_tensorflow_dataset(valid_ds_evset)
non_batched_test_ds = tp.to_tensorflow_dataset(test_ds_evset)
以下處理步驟使用 TensorFlow 資料集應用
extract_features_and_label
以 Keras 期望的格式分離。正如我們之前指出的,資料集在合法交易的方向上是不平衡的。雖然我們希望在此原始分佈上評估我們的模型,但神經網路通常在強烈不平衡的資料集上訓練效果不佳。因此,我們使用 rejection_resample
將訓練資料集重新採樣為 80% 合法 / 20% 詐欺的比率。
def extract_features_and_label(example):
features = {k: example[k] for k in input_feature_names}
labels = tf.cast(example["TX_FRAUD"], tf.int32)
return features, labels
# Target ratio of fraudulent transactions in the training dataset.
target_rate = 0.2
# Number of examples in a mini-batch.
batch_size = 32
train_ds = (
non_batched_train_ds.shuffle(10000)
.rejection_resample(
class_func=lambda x: tf.cast(x["TX_FRAUD"], tf.int32),
target_dist=[1 - target_rate, target_rate],
initial_dist=[1 - fraudulent_rate, fraudulent_rate],
)
.map(lambda _, x: x) # Remove the label copy added by "rejection_resample".
.batch(batch_size)
.map(extract_features_and_label)
.prefetch(tf.data.AUTOTUNE)
)
# The test and validation dataset does not need resampling or shuffling.
valid_ds = (
non_batched_valid_ds.batch(batch_size)
.map(extract_features_and_label)
.prefetch(tf.data.AUTOTUNE)
)
test_ds = (
non_batched_test_ds.batch(batch_size)
.map(extract_features_and_label)
.prefetch(tf.data.AUTOTUNE)
)
WARNING:tensorflow:From /home/gbm/my_venv/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py:4956: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
WARNING:tensorflow:From /home/gbm/my_venv/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py:4956: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
我們列印訓練資料集的前四個範例。這是識別上面可能犯的一些錯誤的簡單方法。
for features, labels in train_ds.take(1):
print("features")
for feature_name, feature_value in features.items():
print(f"\t{feature_name}: {feature_value[:4]}")
print(f"labels: {labels[:4]}")
features
sum_transactions_7_days: [-0.9417254 -1.1157728 -0.5594417 0.7264878]
count_transactions_7_days: [-0.23363686 -0.8702531 -0.23328805 0.7198456 ]
sum_transactions_14_days: [-0.9084115 2.8127224 0.7297886 0.0666021]
count_transactions_14_days: [-0.54289246 2.4122045 0.1963075 0.3798441 ]
sum_transactions_28_days: [-0.44202712 2.3494742 0.20992276 0.97425723]
count_transactions_28_days: [0.02585898 1.8197156 0.12127225 0.9692807 ]
count_fraud_transactions_7_days: [ 8.007475 -0.09783722 1.9282814 -0.09780706]
rate_fraud_transactions_7_days: [14.308702 -0.10952345 1.6929103 -0.10949575]
count_fraud_transactions_14_days: [12.411182 -0.1045466 1.0330476 -0.1045142]
rate_fraud_transactions_14_days: [15.742149 -0.11567765 1.0170861 -0.11565071]
count_fraud_transactions_28_days: [ 7.420907 -0.11298086 0.572011 -0.11293571]
rate_fraud_transactions_28_days: [10.065552 -0.12640427 0.5862939 -0.12637936]
calendar_hour: [-0.68766755 0.6972711 -1.6792761 0.49967623]
calendar_day_of_week: [1.492013 1.4789637 1.4978485 1.4818214]
labels: [1 0 0 0]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
原始資料集是交易性的,但處理後的資料是表格狀的,並且僅包含標準化的數值。因此,我們訓練一個前饋神經網路。
inputs = [keras.Input(shape=(1,), name=name) for name in input_feature_names]
x = keras.layers.concatenate(inputs)
x = keras.layers.Dense(32, activation="sigmoid")(x)
x = keras.layers.Dense(16, activation="sigmoid")(x)
x = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=x)
我們的目標是區分詐欺交易和合法交易,因此我們使用二元分類目標。由於資料集不平衡,因此準確度不是一個資訊豐富的指標。相反地,我們使用曲線下面積 (AUC) 來評估模型。
model.compile(
optimizer=keras.optimizers.Adam(0.01),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.Accuracy(), keras.metrics.AUC()],
)
model.fit(train_ds, validation_data=valid_ds)
5/Unknown 1s 15ms/step - accuracy: 0.0000e+00 - auc: 0.4480 - loss: 0.7678
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
433/Unknown 23s 51ms/step - accuracy: 0.0000e+00 - auc: 0.8060 - loss: 0.3632
/usr/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
self.gen.throw(typ, value, traceback)
433/433 ━━━━━━━━━━━━━━━━━━━━ 30s 67ms/step - accuracy: 0.0000e+00 - auc: 0.8060 - loss: 0.3631 - val_accuracy: 0.0000e+00 - val_auc: 0.8252 - val_loss: 0.2133
<keras.src.callbacks.history.History at 0x7f8f74f0d750>
我們在測試資料集上評估模型。
model.evaluate(test_ds)
9002/9002 ━━━━━━━━━━━━━━━━━━━━ 7s 811us/step - accuracy: 0.0000e+00 - auc: 0.8357 - loss: 0.2161
[0.2171599417924881, 0.0, 0.8266682028770447]
AUC 約為 83%,我們的簡單詐欺偵測器顯示出令人鼓舞的結果。
繪製 ROC 曲線是理解和選擇模型操作點(即應用於模型輸出的閾值以區分詐欺交易和合法交易)的好方法。
計算測試預測
predictions = model.predict(test_ds)
predictions = np.nan_to_num(predictions, nan=0)
9002/9002 ━━━━━━━━━━━━━━━━━━━━ 10s 1ms/step
從測試集中提取標籤
labels = np.concatenate([label for _, label in test_ds])
最後,我們繪製 ROC 曲線。
_ = RocCurveDisplay.from_predictions(labels, predictions)
Keras 模型已準備好用於具有未知詐欺狀態的交易,又名服務。我們將模型儲存到磁碟以供日後使用。
注意: 該模型不包含在 Pandas 和 Temporian 中完成的資料準備和預處理步驟。它們必須手動應用於饋送到模型中的資料。雖然此處未示範,但 Temporian 預處理也可以使用 tp.save 儲存到磁碟。
model.save("fraud_detection_model.keras")
該模型稍後可以使用以下方式重新載入
loaded_model = keras.saving.load_model("fraud_detection_model.keras")
# Generate predictions with the loaded model on 5 test examples.
loaded_model.predict(test_ds.rebatch(5).take(1))
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 71ms/step
/usr/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
self.gen.throw(typ, value, traceback)
array([[0.08197185],
[0.16517264],
[0.13180313],
[0.10209075],
[0.14283912]], dtype=float32)
我們訓練了一個前饋神經網路來識別詐欺交易。為了將它們饋送到模型中,交易被預處理並使用 Temporian 轉換為表格資料集。現在,向讀者提出一個問題:可以做些什麼來進一步提高模型的效能?
以下是一些想法