程式碼範例 / 時間序列 / 支付卡詐欺偵測的事件分類

支付卡詐欺偵測的事件分類

作者: achoum
建立日期 2024/02/01
上次修改日期 2024/02/01
描述: 使用 Temporian 和前饋神經網路偵測支付卡詐欺交易。

ⓘ 此範例使用 Keras 3

在 Colab 中檢視 GitHub 原始碼

此筆記本依賴 Keras 3、Temporian 和一些其他程式庫。您可以如下安裝它們

pip install temporian keras pandas tf-nightly scikit-learn -U
import keras  # To train the Machine Learning model
import temporian as tp  # To convert transactions into tabular data

import numpy as np
import os
import pandas as pd
import datetime
import math
import tensorflow as tf
from sklearn.metrics import RocCurveDisplay

簡介

支付詐欺偵測對於銀行、企業和消費者至關重要。僅在歐洲,詐欺交易估計在 2019 年為 18.9 億歐元。在全球範圍內,大約 3.6% 的商業收入因詐欺而損失。在本筆記本中,我們訓練並評估一個模型,以使用 Le Borgne 等人撰寫的書籍 Reproducible Machine Learning for Credit Card Fraud Detection 隨附的合成資料集來偵測詐欺交易。

詐欺交易通常無法透過單獨查看交易來偵測。相反地,詐欺交易是透過查看來自同一使用者、同一商家或具有其他類型關係的多筆交易的模式來偵測的。為了以機器學習模型可以理解的方式表達這些關係,並透過特徵工程來擴充特徵,我們使用 Temporian 預處理程式庫。

我們將交易資料集預處理為表格資料集,並使用前饋神經網路來學習詐欺模式並進行預測。


載入資料集

資料集包含 2018 年 4 月 1 日至 2018 年 9 月 30 日之間採樣的支付交易。交易儲存在 CSV 檔案中,每天一個檔案。

注意: 下載資料集需要約 1 分鐘。

start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 9, 30)

# Load the dataset as a Pandas dataframe.
cache_path = "fraud_detection_cache.csv"
if not os.path.exists(cache_path):
    print("Download dataset")
    dataframes = []
    num_files = (end_date - start_date).days
    counter = 0
    while start_date <= end_date:
        if counter % (num_files // 10) == 0:
            print(f"[{100 * (counter+1) // num_files}%]", end="", flush=True)
        print(".", end="", flush=True)
        url = f"https://github.com/Fraud-Detection-Handbook/simulated-data-raw/raw/6e67dbd0a3bfe0d7ec33abc4bce5f37cd4ff0d6a/data/{start_date}.pkl"
        dataframes.append(pd.read_pickle(url))
        start_date += datetime.timedelta(days=1)
        counter += 1
    print("done", flush=True)
    transactions_dataframe = pd.concat(dataframes)
    transactions_dataframe.to_csv(cache_path, index=False)
else:
    print("Load dataset from cache")
    transactions_dataframe = pd.read_csv(
        cache_path, dtype={"CUSTOMER_ID": bytes, "TERMINAL_ID": bytes}
    )

print(f"Found {len(transactions_dataframe)} transactions")
Download dataset
[0%]..................[10%]..................[20%]..................[30%]..................[40%]..................[50%]..................[59%]..................[69%]..................[79%]..................[89%]..................[99%]...done
Found 1754155 transactions

每筆交易都由單一列表示,具有以下感興趣的欄位

  • TX_DATETIME:交易的日期和時間。
  • CUSTOMER_ID:客戶的唯一識別碼。
  • TERMINAL_ID:進行交易的終端機識別碼。
  • TX_AMOUNT:交易金額。
  • TX_FRAUD:交易是否為詐欺 (1) 或非詐欺 (0)。
transactions_dataframe = transactions_dataframe[
    ["TX_DATETIME", "CUSTOMER_ID", "TERMINAL_ID", "TX_AMOUNT", "TX_FRAUD"]
]

transactions_dataframe.head(4)
TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_FRAUD
0 2018-04-01 00:00:31 596 3156 57.16 0
1 2018-04-01 00:02:10 4961 3412 81.51 0
2 2018-04-01 00:07:56 2 1365 146.00 0
3 2018-04-01 00:09:29 4128 8737 64.49 0

資料集高度不平衡,大多數交易都是合法的。

fraudulent_rate = transactions_dataframe["TX_FRAUD"].mean()
print("Rate of fraudulent transactions:", fraudulent_rate)
Rate of fraudulent transactions: 0.008369271814634397

pandas dataframe 被轉換為 Temporian EventSet,更適合用於後續步驟的資料探索和特徵預處理。

transactions_evset = tp.from_pandas(transactions_dataframe, timestamps="TX_DATETIME")

transactions_evset
WARNING:root:Feature "CUSTOMER_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "TERMINAL_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
特徵 [4]: CUSTOMER_ID (str_) , TERMINAL_ID (str_) , TX_AMOUNT (float64) , TX_FRAUD (int64)
索引 [0]:
事件: 1754155
索引值: 1
記憶體用量: 28.1 MB
索引 ( ) 具有 1754155 個事件
時間戳記 CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_FRAUD
2018-04-01 00:00:31+00:00 596 3156 57.16 0
2018-04-01 00:02:10+00:00 4961 3412 81.51 0
2018-04-01 00:07:56+00:00 2 1365 146 0
2018-04-01 00:09:29+00:00 4128 8737 64.49 0
2018-04-01 00:10:34+00:00 927 9906 50.99 0

可以繪製整個資料集,但產生的圖表將難以閱讀。相反地,我們可以按客戶對交易進行分組。

transactions_evset.add_index("CUSTOMER_ID").plot(indexes="3774")

png

請注意此客戶的少數詐欺交易。


準備訓練資料

無法單獨偵測詐欺交易。相反地,我們需要連接相關交易。對於每筆交易,我們計算過去 n 天內同一終端機的交易總和和計數。因為我們不知道 n 的正確值,所以我們使用 n 的多個值,並為每個值計算一組特徵。

# Group the transactions per terminal
transactions_per_terminal = transactions_evset.add_index("TERMINAL_ID")

# Moving statistics per terminal
tmp_features = []
for n in [7, 14, 28]:
    tmp_features.append(
        transactions_per_terminal["TX_AMOUNT"]
        .moving_sum(tp.duration.days(n))
        .rename(f"sum_transactions_{n}_days")
    )

    tmp_features.append(
        transactions_per_terminal.moving_count(tp.duration.days(n)).rename(
            f"count_transactions_{n}_days"
        )
    )

feature_set_1 = tp.glue(*tmp_features)

feature_set_1
特徵 [6]: sum_transactions_7_days (float64) , count_transactions_7_days (int32) , sum_transactions_14_days (float64) , count_transactions_14_days (int32) , sum_transactions_28_days (float64) , count_transactions_28_days (int32)
索引 [1]: TERMINAL_ID (str_)
事件: 1754155
索引值: 10000
記憶體用量: 85.8 MB
索引 ( TERMINAL_ID: 0 ) 具有 178 個事件
時間戳記 sum_transactions_7_days count_transactions_7_days sum_transactions_14_days count_transactions_14_days sum_transactions_28_days count_transactions_28_days
2018-04-02 01:00:01+00:00 16.07 1 16.07 1 16.07 1
2018-04-02 09:49:55+00:00 83.9 2 83.9 2 83.9 2
2018-04-03 12:14:41+00:00 110.7 3 110.7 3 110.7 3
2018-04-05 16:47:41+00:00 151.2 4 151.2 4 151.2 4
2018-04-07 06:05:21+00:00 199.6 5 199.6 5 199.6 5
索引 ( TERMINAL_ID: 1 ) 具有 139 個事件
時間戳記 sum_transactions_7_days count_transactions_7_days sum_transactions_14_days count_transactions_14_days sum_transactions_28_days count_transactions_28_days
2018-04-01 16:24:39+00:00 70.36 1 70.36 1 70.36 1
2018-04-02 11:25:03+00:00 87.79 2 87.79 2 87.79 2
2018-04-04 08:31:48+00:00 211.6 3 211.6 3 211.6 3
2018-04-04 14:15:28+00:00 315 4 315 4 315 4
2018-04-04 20:54:17+00:00 446.5 5 446.5 5 446.5 5
索引 ( TERMINAL_ID: 10 ) 具有 151 個事件
時間戳記 sum_transactions_7_days count_transactions_7_days sum_transactions_14_days count_transactions_14_days sum_transactions_28_days count_transactions_28_days
2018-04-01 14:11:55+00:00 2.9 1 2.9 1 2.9 1
2018-04-02 11:01:07+00:00 17.04 2 17.04 2 17.04 2
2018-04-03 13:46:58+00:00 118.2 3 118.2 3 118.2 3
2018-04-04 03:27:11+00:00 161.7 4 161.7 4 161.7 4
2018-04-05 17:58:10+00:00 171.3 5 171.3 5 171.3 5
索引 ( TERMINAL_ID: 100 ) 具有 188 個事件
時間戳記 sum_transactions_7_days count_transactions_7_days sum_transactions_14_days count_transactions_14_days sum_transactions_28_days count_transactions_28_days
2018-04-02 10:37:42+00:00 6.31 1 6.31 1 6.31 1
2018-04-04 19:14:23+00:00 12.26 2 12.26 2 12.26 2
2018-04-07 04:01:22+00:00 65.12 3 65.12 3 65.12 3
2018-04-07 12:18:27+00:00 112.4 4 112.4 4 112.4 4
2018-04-07 21:11:03+00:00 170.4 5 170.4 5 170.4 5
… (還有 9996 個索引未顯示)

讓我們看看終端機 "3774" 的特徵。

feature_set_1.plot(indexes="3774")

png

交易的詐欺狀態在交易時未知(否則就沒有問題了)。但是,銀行會在交易完成後一週知道交易是否為詐欺。我們建立一組特徵,指示過去 N 天內詐欺交易的數量和比率。

# Lag the transactions by one week.
lagged_transactions = transactions_per_terminal.lag(tp.duration.weeks(1))

# Moving statistics per customer
tmp_features = []
for n in [7, 14, 28]:
    tmp_features.append(
        lagged_transactions["TX_FRAUD"]
        .moving_sum(tp.duration.days(n), sampling=transactions_per_terminal)
        .rename(f"count_fraud_transactions_{n}_days")
    )

    tmp_features.append(
        lagged_transactions["TX_FRAUD"]
        .cast(tp.float32)
        .simple_moving_average(tp.duration.days(n), sampling=transactions_per_terminal)
        .rename(f"rate_fraud_transactions_{n}_days")
    )

feature_set_2 = tp.glue(*tmp_features)

交易日期和時間可能與詐欺相關。雖然每筆交易都有時間戳記,但機器學習模型可能難以直接使用它們。相反地,我們從時間戳記中提取各種資訊豐富的日曆特徵,例如小時、星期幾(例如,星期一、星期二)和月份中的日期 (1-31)。

feature_set_3 = tp.glue(
    transactions_per_terminal.calendar_hour(),
    transactions_per_terminal.calendar_day_of_week(),
)

最後,我們將所有特徵和標籤分組在一起。

all_data = tp.glue(
    transactions_per_terminal, feature_set_1, feature_set_2, feature_set_3
).drop_index()

print("All the available features:")
all_data.schema.feature_names()
All the available features:

['CUSTOMER_ID',
 'TX_AMOUNT',
 'TX_FRAUD',
 'sum_transactions_7_days',
 'count_transactions_7_days',
 'sum_transactions_14_days',
 'count_transactions_14_days',
 'sum_transactions_28_days',
 'count_transactions_28_days',
 'count_fraud_transactions_7_days',
 'rate_fraud_transactions_7_days',
 'count_fraud_transactions_14_days',
 'rate_fraud_transactions_14_days',
 'count_fraud_transactions_28_days',
 'rate_fraud_transactions_28_days',
 'calendar_hour',
 'calendar_day_of_week',
 'TERMINAL_ID']

我們提取輸入特徵的名稱。

input_feature_names = [k for k in all_data.schema.feature_names() if k.islower()]

print("The model's input features:")
input_feature_names
The model's input features:

['sum_transactions_7_days',
 'count_transactions_7_days',
 'sum_transactions_14_days',
 'count_transactions_14_days',
 'sum_transactions_28_days',
 'count_transactions_28_days',
 'count_fraud_transactions_7_days',
 'rate_fraud_transactions_7_days',
 'count_fraud_transactions_14_days',
 'rate_fraud_transactions_14_days',
 'count_fraud_transactions_28_days',
 'rate_fraud_transactions_28_days',
 'calendar_hour',
 'calendar_day_of_week']

為了讓神經網路正常運作,數值輸入必須標準化。一種常見的方法是應用 z 標準化,這涉及從訓練資料中估計的平均值和標準差中減去每個值。在預測中,不建議使用此類 z 標準化,因為它會導致未來洩漏。具體來說,為了對時間 t 的交易進行分類,我們不能依賴時間 t 之後的資料,因為在服務時間在時間 t 進行預測時,尚無後續資料可用。簡而言之,在時間 t,我們僅限於使用時間 t 之前或與時間 t 同時發生的資料。

因此,解決方案是隨著時間推移應用 z 標準化,這表示我們使用該交易過去資料計算的平均值和標準差來標準化每筆交易。

未來洩漏是有害的。幸運的是,Temporian 在這裡提供協助:唯一可能導致未來洩漏的運算子是 EventSet.leak()。如果您沒有使用 EventSet.leak(),則保證您的預處理不會產生未來洩漏。

注意: 對於進階管線,您也可以程式化地檢查特徵是否不依賴 EventSet.leak() 運算。

# Cast all values (e.g. ints) to floats.
values = all_data[input_feature_names].cast(tp.float32)

# Apply z-normalization overtime.
normalized_features = (
    values - values.simple_moving_average(math.inf)
) / values.moving_standard_deviation(math.inf)

# Restore the original name of the features.
normalized_features = normalized_features.rename(values.schema.feature_names())

print(normalized_features)
indexes: []
features: [('sum_transactions_7_days', float32), ('count_transactions_7_days', float32), ('sum_transactions_14_days', float32), ('count_transactions_14_days', float32), ('sum_transactions_28_days', float32), ('count_transactions_28_days', float32), ('count_fraud_transactions_7_days', float32), ('rate_fraud_transactions_7_days', float32), ('count_fraud_transactions_14_days', float32), ('rate_fraud_transactions_14_days', float32), ('count_fraud_transactions_28_days', float32), ('rate_fraud_transactions_28_days', float32), ('calendar_hour', float32), ('calendar_day_of_week', float32)]
events:
     (1754155 events):
        timestamps: ['2018-04-01T00:00:31' '2018-04-01T00:02:10' '2018-04-01T00:07:56' ...
     '2018-09-30T23:58:21' '2018-09-30T23:59:52' '2018-09-30T23:59:57']
        'sum_transactions_7_days': [ 0.      1.      1.3636 ... -0.064  -0.2059  0.8428]
        'count_transactions_7_days': [   nan    nan    nan ... 1.0128 0.6892 1.66  ]
        'sum_transactions_14_days': [ 0.      1.      1.3636 ... -0.7811  0.156   1.379 ]
        'count_transactions_14_days': [   nan    nan    nan ... 0.2969 0.2969 2.0532]
        'sum_transactions_28_days': [ 0.      1.      1.3636 ... -0.7154 -0.2989  1.9396]
        'count_transactions_28_days': [    nan     nan     nan ...  0.1172 -0.1958  1.8908]
        'count_fraud_transactions_7_days': [    nan     nan     nan ... -0.1043 -0.1043 -0.1043]
        'rate_fraud_transactions_7_days': [    nan     nan     nan ... -0.1137 -0.1137 -0.1137]
        'count_fraud_transactions_14_days': [    nan     nan     nan ... -0.1133 -0.1133  0.9303]
        'rate_fraud_transactions_14_days': [    nan     nan     nan ... -0.1216 -0.1216  0.5275]
        ...
memory usage: 112.3 MB
/home/gbm/my_venv/lib/python3.11/site-packages/temporian/implementation/numpy/operators/binary/arithmetic.py:100: RuntimeWarning: invalid value encountered in divide
  return evset_1_feature / evset_2_feature

由於之前的交易很少,因此第一個交易將使用平均值和標準差的差估計值進行標準化。為了減輕此問題,我們從訓練資料集中刪除第一週的資料。

請注意,第一個值包含 NaN。在 Temporian 中,NaN 代表遺失值,所有運算子都會相應地處理它們。例如,在計算移動平均值時,NaN 值不包含在計算中,也不會產生 NaN 結果。

但是,神經網路無法原生處理 NaN 值。因此,我們將它們替換為零。

normalized_features = normalized_features.fillna(0.0)

最後,我們將特徵和標籤分組在一起。

normalized_all_data = tp.glue(normalized_features, all_data["TX_FRAUD"])

將資料集分割為訓練集、驗證集和測試集

為了評估我們的機器學習模型的品質,我們需要訓練集、驗證集和測試集。由於系統是動態的(新的詐欺模式一直在建立),因此訓練集在驗證集之前,驗證集在測試集之前非常重要

  • 訓練: 2018 年 4 月 8 日至 2018 年 7 月 31 日
  • 驗證: 2018 年 8 月 1 日至 2018 年 8 月 31 日
  • 測試: 2018 年 9 月 1 日至 2018 年 9 月 30 日

為了讓範例執行更快,我們將有效縮減訓練集的大小:- 訓練: 2018 年 7 月 1 日至 2018 年 7 月 31 日

# begin_train = datetime.datetime(2018, 4, 8).timestamp() # Full training dataset
begin_train = datetime.datetime(2018, 7, 1).timestamp()  # Reduced training dataset
begin_valid = datetime.datetime(2018, 8, 1).timestamp()
begin_test = datetime.datetime(2018, 9, 1).timestamp()

is_train = (normalized_all_data.timestamps() >= begin_train) & (
    normalized_all_data.timestamps() < begin_valid
)
is_valid = (normalized_all_data.timestamps() >= begin_valid) & (
    normalized_all_data.timestamps() < begin_test
)
is_test = normalized_all_data.timestamps() >= begin_test

is_trainis_validis_test 是隨時間推移的布林特徵,指示樹狀摺疊的限制。讓我們繪製它們。

tp.plot(
    [
        is_train.rename("is_train"),
        is_valid.rename("is_valid"),
        is_test.rename("is_test"),
    ]
)

png

我們在每個摺疊中篩選輸入特徵和標籤。

train_ds_evset = normalized_all_data.filter(is_train)
valid_ds_evset = normalized_all_data.filter(is_valid)
test_ds_evset = normalized_all_data.filter(is_test)

print(f"Training examples: {train_ds_evset.num_events()}")
print(f"Validation examples: {valid_ds_evset.num_events()}")
print(f"Testing examples: {test_ds_evset.num_events()}")
Training examples: 296924
Validation examples: 296579
Testing examples: 288064

在計算特徵之後分割資料集非常重要,因為訓練資料集的一些特徵是從訓練視窗期間的交易計算出來的。


建立 TensorFlow 資料集

我們將資料集從 EventSet 轉換為 TensorFlow 資料集,因為 Keras 原生使用它們。

non_batched_train_ds = tp.to_tensorflow_dataset(train_ds_evset)
non_batched_valid_ds = tp.to_tensorflow_dataset(valid_ds_evset)
non_batched_test_ds = tp.to_tensorflow_dataset(test_ds_evset)

以下處理步驟使用 TensorFlow 資料集應用

  1. 特徵和標籤使用 extract_features_and_label 以 Keras 期望的格式分離。
  2. 資料集已批次處理,這表示範例被分組為小批量。
  3. 訓練範例會被隨機排序,以提高小批量訓練的品質。

正如我們之前指出的,資料集在合法交易的方向上是不平衡的。雖然我們希望在此原始分佈上評估我們的模型,但神經網路通常在強烈不平衡的資料集上訓練效果不佳。因此,我們使用 rejection_resample 將訓練資料集重新採樣為 80% 合法 / 20% 詐欺的比率。

def extract_features_and_label(example):
    features = {k: example[k] for k in input_feature_names}
    labels = tf.cast(example["TX_FRAUD"], tf.int32)
    return features, labels


# Target ratio of fraudulent transactions in the training dataset.
target_rate = 0.2

# Number of examples in a mini-batch.
batch_size = 32

train_ds = (
    non_batched_train_ds.shuffle(10000)
    .rejection_resample(
        class_func=lambda x: tf.cast(x["TX_FRAUD"], tf.int32),
        target_dist=[1 - target_rate, target_rate],
        initial_dist=[1 - fraudulent_rate, fraudulent_rate],
    )
    .map(lambda _, x: x)  # Remove the label copy added by "rejection_resample".
    .batch(batch_size)
    .map(extract_features_and_label)
    .prefetch(tf.data.AUTOTUNE)
)

# The test and validation dataset does not need resampling or shuffling.
valid_ds = (
    non_batched_valid_ds.batch(batch_size)
    .map(extract_features_and_label)
    .prefetch(tf.data.AUTOTUNE)
)
test_ds = (
    non_batched_test_ds.batch(batch_size)
    .map(extract_features_and_label)
    .prefetch(tf.data.AUTOTUNE)
)
WARNING:tensorflow:From /home/gbm/my_venv/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py:4956: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
WARNING:tensorflow:From /home/gbm/my_venv/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py:4956: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

我們列印訓練資料集的前四個範例。這是識別上面可能犯的一些錯誤的簡單方法。

for features, labels in train_ds.take(1):
    print("features")
    for feature_name, feature_value in features.items():
        print(f"\t{feature_name}: {feature_value[:4]}")
    print(f"labels: {labels[:4]}")
features
    sum_transactions_7_days: [-0.9417254 -1.1157728 -0.5594417  0.7264878]
    count_transactions_7_days: [-0.23363686 -0.8702531  -0.23328805  0.7198456 ]
    sum_transactions_14_days: [-0.9084115  2.8127224  0.7297886  0.0666021]
    count_transactions_14_days: [-0.54289246  2.4122045   0.1963075   0.3798441 ]
    sum_transactions_28_days: [-0.44202712  2.3494742   0.20992276  0.97425723]
    count_transactions_28_days: [0.02585898 1.8197156  0.12127225 0.9692807 ]
    count_fraud_transactions_7_days: [ 8.007475   -0.09783722  1.9282814  -0.09780706]
    rate_fraud_transactions_7_days: [14.308702   -0.10952345  1.6929103  -0.10949575]
    count_fraud_transactions_14_days: [12.411182  -0.1045466  1.0330476 -0.1045142]
    rate_fraud_transactions_14_days: [15.742149   -0.11567765  1.0170861  -0.11565071]
    count_fraud_transactions_28_days: [ 7.420907   -0.11298086  0.572011   -0.11293571]
    rate_fraud_transactions_28_days: [10.065552   -0.12640427  0.5862939  -0.12637936]
    calendar_hour: [-0.68766755  0.6972711  -1.6792761   0.49967623]
    calendar_day_of_week: [1.492013  1.4789637 1.4978485 1.4818214]
labels: [1 0 0 0]

Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]

訓練模型

原始資料集是交易性的,但處理後的資料是表格狀的,並且僅包含標準化的數值。因此,我們訓練一個前饋神經網路。

inputs = [keras.Input(shape=(1,), name=name) for name in input_feature_names]
x = keras.layers.concatenate(inputs)
x = keras.layers.Dense(32, activation="sigmoid")(x)
x = keras.layers.Dense(16, activation="sigmoid")(x)
x = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=x)

我們的目標是區分詐欺交易和合法交易,因此我們使用二元分類目標。由於資料集不平衡,因此準確度不是一個資訊豐富的指標。相反地,我們使用曲線下面積 (AUC) 來評估模型。

model.compile(
    optimizer=keras.optimizers.Adam(0.01),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.Accuracy(), keras.metrics.AUC()],
)
model.fit(train_ds, validation_data=valid_ds)
      5/Unknown  1s 15ms/step - accuracy: 0.0000e+00 - auc: 0.4480 - loss: 0.7678

Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]
Proportion of examples rejected by sampler is high: [0.991630733][0.991630733 0.00836927164][0 1]

    433/Unknown  23s 51ms/step - accuracy: 0.0000e+00 - auc: 0.8060 - loss: 0.3632

/usr/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(typ, value, traceback)

 433/433 ━━━━━━━━━━━━━━━━━━━━ 30s 67ms/step - accuracy: 0.0000e+00 - auc: 0.8060 - loss: 0.3631 - val_accuracy: 0.0000e+00 - val_auc: 0.8252 - val_loss: 0.2133

<keras.src.callbacks.history.History at 0x7f8f74f0d750>

我們在測試資料集上評估模型。

model.evaluate(test_ds)
 9002/9002 ━━━━━━━━━━━━━━━━━━━━ 7s 811us/step - accuracy: 0.0000e+00 - auc: 0.8357 - loss: 0.2161

[0.2171599417924881, 0.0, 0.8266682028770447]

AUC 約為 83%,我們的簡單詐欺偵測器顯示出令人鼓舞的結果。

繪製 ROC 曲線是理解和選擇模型操作點(即應用於模型輸出的閾值以區分詐欺交易和合法交易)的好方法。

計算測試預測

predictions = model.predict(test_ds)
predictions = np.nan_to_num(predictions, nan=0)
 9002/9002 ━━━━━━━━━━━━━━━━━━━━ 10s 1ms/step

從測試集中提取標籤

labels = np.concatenate([label for _, label in test_ds])

最後,我們繪製 ROC 曲線。

_ = RocCurveDisplay.from_predictions(labels, predictions)

png

Keras 模型已準備好用於具有未知詐欺狀態的交易,又名服務。我們將模型儲存到磁碟以供日後使用。

注意: 該模型不包含在 Pandas 和 Temporian 中完成的資料準備和預處理步驟。它們必須手動應用於饋送到模型中的資料。雖然此處未示範,但 Temporian 預處理也可以使用 tp.save 儲存到磁碟。

model.save("fraud_detection_model.keras")

該模型稍後可以使用以下方式重新載入

loaded_model = keras.saving.load_model("fraud_detection_model.keras")

# Generate predictions with the loaded model on 5 test examples.
loaded_model.predict(test_ds.rebatch(5).take(1))
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 71ms/step

/usr/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(typ, value, traceback)

array([[0.08197185],
       [0.16517264],
       [0.13180313],
       [0.10209075],
       [0.14283912]], dtype=float32)

結論

我們訓練了一個前饋神經網路來識別詐欺交易。為了將它們饋送到模型中,交易被預處理並使用 Temporian 轉換為表格資料集。現在,向讀者提出一個問題:可以做些什麼來進一步提高模型的效能?

以下是一些想法

  • 在整個資料集而不是單一月份的資料上訓練模型。
  • 訓練模型更多 epochs 並使用提前停止,以確保模型在不過度擬合的情況下得到充分訓練。
  • 透過增加層數來使前饋網路更強大,同時確保模型被正規化。
  • 計算額外的預處理特徵。例如,除了按終端機彙總交易外,還可以按客戶彙總交易。
  • 使用 Keras Tuner 對模型執行超參數調校。請注意,預處理的參數(例如,彙總天數)也是可以調校的超參數。