說明: 使用 TensorFlow 決策樹森林進行結構化資料分類。
TensorFlow 決策樹森林是與 Keras API 相容的最新決策樹森林模型演算法的集合。這些模型包括隨機森林、梯度提升樹和CART,可用於迴歸、分類和排序任務。有關 TensorFlow 決策樹森林的初學者指南,請參閱此教學。
此範例使用 TensorFlow 2.7 或更高版本,以及TensorFlow 決策樹森林,您可以使用以下指令安裝
pip install -U tensorflow_decision_forests
import math
import urllib
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_decision_forests as tfdf
此範例使用美國人口普查收入資料集,由加州大學爾灣分校機器學習儲存庫提供。這項任務是二元分類,以判斷一個人是否年收入超過 5 萬美元。
資料集包含約 30 萬個實例,具有 41 個輸入特徵:7 個數值特徵和 34 個分類特徵。
首先,我們將資料從加州大學爾灣分校機器學習儲存庫載入到 Pandas DataFrame 中。
BASE_PATH = "https://kdd.ics.uci.edu/databases/census-income/census-income"
l.decode("utf-8").split(":")[0].replace(" ", "_")
for l in urllib.request.urlopen(f"{BASE_PATH}.names")
if not l.startswith(b"|")
train_data = pd.read_csv(f"{BASE_PATH}.data.gz", header=None, names=CSV_HEADER,)
test_data = pd.read_csv(f"{BASE_PATH}.test.gz", header=None, names=CSV_HEADER,)
# Target column name.
TARGET_COLUMN_NAME = "income_level"
# The labels of the target columns.
TARGET_LABELS = [" - 50000.", " 50000+."]
# Weight column name.
WEIGHT_COLUMN_NAME = "instance_weight"
# Numeric feature names.
# Categorical features and their vocabulary lists.
def prepare_dataframe(dataframe):
# Convert the target labels from string to integer.
dataframe[TARGET_COLUMN_NAME] = dataframe[TARGET_COLUMN_NAME].map(
# Cast the categorical features to string.
for feature_name in CATEGORICAL_FEATURE_NAMES:
dataframe[feature_name] = dataframe[feature_name].astype(str)
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")
Train data shape: (199523, 42)
Test data shape: (99762, 42)
# Maximum number of decision trees. The effective number of trained trees can be smaller if early stopping is enabled.
# Minimum number of examples in a node.
# Maximum depth of the tree. max_depth=1 means that all trees will be roots.
# Ratio of the dataset (sampling without replacement) used to train individual trees for the random sampling method.
# Control the sampling of the datasets used to train individual trees.
# Ratio of the training dataset used to monitor the training. Require to be >0 if early stopping is enabled.
請注意,在訓練決策樹森林模型時,只需要一個 epoch 來讀取完整資料集。任何額外的步驟都會導致不必要的訓練速度變慢。因此,run_experiment()
方法中使用預設的 num_epochs=1
def run_experiment(model, train_data, test_data, num_epochs=1, batch_size=None):
train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
model.fit(train_dataset, epochs=num_epochs, batch_size=batch_size)
_, accuracy = model.evaluate(test_dataset, verbose=0)
print(f"Test accuracy: {round(accuracy * 100, 2)}%")
您可以為每個特徵附加語意,以控制模型如何使用它。如果未指定,則會從表示類型推斷語意。建議明確指定特徵使用方式,以避免推斷的語意不正確。例如,分類值識別碼 (整數) 會被推斷為數值,但在語意上是分類。
對於數值特徵,您可以將 discretized
def specify_feature_usages():
feature_usages = []
for feature_name in NUMERIC_FEATURE_NAMES:
feature_usage = tfdf.keras.FeatureUsage(
name=feature_name, semantic=tfdf.keras.FeatureSemantic.NUMERICAL
for feature_name in CATEGORICAL_FEATURE_NAMES:
feature_usage = tfdf.keras.FeatureUsage(
name=feature_name, semantic=tfdf.keras.FeatureSemantic.CATEGORICAL
return feature_usages
def create_gbt_model():
# See all the model parameters in https://tensorflow.dev.org.tw/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel
gbt_model = tfdf.keras.GradientBoostedTreesModel(
return gbt_model
gbt_model = create_gbt_model()
run_experiment(gbt_model, train_data, test_data)
:給定特徵值時,目標標籤為正的機率,計算方式為 positive_frequency / (positive_frequency + negative_frequency + correction)
。新增 correction
的預設值為 1.0。請注意,目標編碼對於無法自動學習分類特徵的密集表示的模型 (例如決策樹森林或核心方法) 有效。如果使用神經網路模型,建議將分類特徵編碼為嵌入。
為簡單起見,我們假設 adapt
和 call
建議將分類特徵的 vocabulary_size
傳遞給 BinaryTargetEncoding
建構子。如果未指定,它會在執行 adapt()
class BinaryTargetEncoding(layers.Layer):
def __init__(self, vocabulary_size=None, correction=1.0, **kwargs):
self.vocabulary_size = vocabulary_size
self.correction = correction
def adapt(self, data):
# data is expected to be an integer numpy array to a Tensor shape [num_exmples, 2].
# This contains feature values for a given feature in the dataset, and target values.
# Convert the data to a tensor.
data = tf.convert_to_tensor(data)
# Separate the feature values and target values
feature_values = tf.cast(data[:, 0], tf.dtypes.int32)
target_values = tf.cast(data[:, 1], tf.dtypes.bool)
# Compute the vocabulary_size of not specified.
if self.vocabulary_size is None:
self.vocabulary_size = tf.unique(feature_values).y.shape[0]
# Filter the data where the target label is positive.
positive_indices = tf.where(condition=target_values)
positive_feature_values = tf.gather_nd(
params=feature_values, indices=positive_indices
# Compute how many times each feature value occurred with a positive target label.
positive_frequency = tf.math.unsorted_segment_sum(
shape=(positive_feature_values.shape[0], 1), dtype=tf.dtypes.float64
# Filter the data where the target label is negative.
negative_indices = tf.where(condition=tf.math.logical_not(target_values))
negative_feature_values = tf.gather_nd(
params=feature_values, indices=negative_indices
# Compute how many times each feature value occurred with a negative target label.
negative_frequency = tf.math.unsorted_segment_sum(
shape=(negative_feature_values.shape[0], 1), dtype=tf.dtypes.float64
# Compute positive probability for the input feature values.
positive_probability = positive_frequency / (
positive_frequency + negative_frequency + self.correction
# Concatenate the computed statistics for traget_encoding.
target_encoding_statistics = tf.cast(
[positive_frequency, negative_frequency, positive_probability], axis=1
self.target_encoding_statistics = tf.constant(target_encoding_statistics)
def call(self, inputs):
# inputs is expected to be an integer numpy array to a Tensor shape [num_exmples, 1].
# This includes the feature values for a given feature in the dataset.
# Raise an error if the target encoding statistics are not computed.
if self.target_encoding_statistics == None:
raise ValueError(
f"You need to call the adapt method to compute target encoding statistics."
# Convert the inputs to a tensor.
inputs = tf.convert_to_tensor(inputs)
# Cast the inputs int64 a tensor.
inputs = tf.cast(inputs, tf.dtypes.int64)
# Lookup target encoding statistics for the input feature values.
target_encoding_statistics = tf.cast(
tf.gather_nd(self.target_encoding_statistics, inputs),
return target_encoding_statistics
data = tf.constant(
[0, 1],
[2, 0],
[0, 1],
[1, 1],
[1, 1],
[2, 0],
[1, 0],
[0, 1],
[2, 1],
[1, 0],
[0, 1],
[2, 0],
[0, 1],
[1, 1],
[1, 1],
[2, 0],
[1, 0],
[0, 1],
[2, 0],
binary_target_encoder = BinaryTargetEncoding()
print(binary_target_encoder([[0], [1], [2]]))
[[6. 0. 0.85714287]
[4. 3. 0.5 ]
[1. 5. 0.14285715]], shape=(3, 3), dtype=float32)
def create_model_inputs():
inputs = {}
for feature_name in NUMERIC_FEATURE_NAMES:
inputs[feature_name] = layers.Input(
name=feature_name, shape=(), dtype=tf.float32
for feature_name in CATEGORICAL_FEATURE_NAMES:
inputs[feature_name] = layers.Input(
name=feature_name, shape=(), dtype=tf.string
return inputs
def create_target_encoder():
inputs = create_model_inputs()
target_values = train_data[[TARGET_COLUMN_NAME]].to_numpy()
encoded_features = []
for feature_name in inputs:
# Get the vocabulary of the categorical feature.
vocabulary = sorted(
[str(value) for value in list(train_data[feature_name].unique())]
# Create a lookup to convert string values to an integer indices.
# Since we are not using a mask token nor expecting any out of vocabulary
# (oov) token, we set mask_token to None and num_oov_indices to 0.
lookup = layers.StringLookup(
vocabulary=vocabulary, mask_token=None, num_oov_indices=0
# Convert the string input values into integer indices.
value_indices = lookup(inputs[feature_name])
# Prepare the data to adapt the target encoding.
print("### Adapting target encoding for:", feature_name)
feature_values = train_data[[feature_name]].to_numpy().astype(str)
feature_value_indices = lookup(feature_values)
data = tf.concat([feature_value_indices, target_values], axis=1)
feature_encoder = BinaryTargetEncoding()
# Convert the feature value indices to target encoding representations.
encoded_feature = feature_encoder(tf.expand_dims(value_indices, -1))
# Expand the dimensions of the numerical input feature and use it as-is.
encoded_feature = tf.expand_dims(inputs[feature_name], -1)
# Add the encoded feature to the list.
# Concatenate all the encoded features.
encoded_features = tf.concat(encoded_features, axis=1)
# Create and return a Keras model with encoded features as outputs.
return keras.Model(inputs=inputs, outputs=encoded_features)
def create_gbt_with_preprocessor(preprocessor):
gbt_model = tfdf.keras.GradientBoostedTreesModel(
return gbt_model
gbt_model = create_gbt_with_preprocessor(create_target_encoder())
run_experiment(gbt_model, train_data, test_data)
### Adapting target encoding for: class_of_worker
### Adapting target encoding for: detailed_industry_recode
### Adapting target encoding for: detailed_occupation_recode
### Adapting target encoding for: education
### Adapting target encoding for: enroll_in_edu_inst_last_wk
### Adapting target encoding for: marital_stat
### Adapting target encoding for: major_industry_code
### Adapting target encoding for: major_occupation_code
### Adapting target encoding for: race
### Adapting target encoding for: hispanic_origin
### Adapting target encoding for: sex
### Adapting target encoding for: member_of_a_labor_union
### Adapting target encoding for: reason_for_unemployment
### Adapting target encoding for: full_or_part_time_employment_stat
### Adapting target encoding for: tax_filer_stat
### Adapting target encoding for: region_of_previous_residence
### Adapting target encoding for: state_of_previous_residence
### Adapting target encoding for: detailed_household_and_family_stat
### Adapting target encoding for: detailed_household_summary_in_household
### Adapting target encoding for: migration_code-change_in_msa
### Adapting target encoding for: migration_code-change_in_reg
### Adapting target encoding for: migration_code-move_within_reg
### Adapting target encoding for: live_in_this_house_1_year_ago
### Adapting target encoding for: migration_prev_res_in_sunbelt
### Adapting target encoding for: family_members_under_18
### Adapting target encoding for: country_of_birth_father
### Adapting target encoding for: country_of_birth_mother
### Adapting target encoding for: country_of_birth_self
### Adapting target encoding for: citizenship
### Adapting target encoding for: own_business_or_self_employed
### Adapting target encoding for: fill_inc_questionnaire_for_veteran's_admin
### Adapting target encoding for: veterans_benefits
### Adapting target encoding for: year
Use /tmp/tmpj_0h78ld as temporary training directory
Starting reading the dataset
198/200 [============================>.] - ETA: 0s
Dataset read in 0:00:06.793717
Training model
Model trained in 0:04:32.752691
Compiling model
200/200 [==============================] - 280s 1s/step
Test accuracy: 95.81%
我們在簡單的 NN 模型中透過反向傳播訓練這些嵌入。訓練嵌入編碼器後,我們將其用作梯度提升樹模型輸入特徵的預處理器。
def create_embedding_encoder(size=None):
inputs = create_model_inputs()
encoded_features = []
for feature_name in inputs:
# Get the vocabulary of the categorical feature.
vocabulary = sorted(
[str(value) for value in list(train_data[feature_name].unique())]
# Create a lookup to convert string values to an integer indices.
# Since we are not using a mask token nor expecting any out of vocabulary
# (oov) token, we set mask_token to None and num_oov_indices to 0.
lookup = layers.StringLookup(
vocabulary=vocabulary, mask_token=None, num_oov_indices=0
# Convert the string input values into integer indices.
value_index = lookup(inputs[feature_name])
# Create an embedding layer with the specified dimensions
vocabulary_size = len(vocabulary)
embedding_size = int(math.sqrt(vocabulary_size))
feature_encoder = layers.Embedding(
input_dim=len(vocabulary), output_dim=embedding_size
# Convert the index values to embedding representations.
encoded_feature = feature_encoder(value_index)
# Expand the dimensions of the numerical input feature and use it as-is.
encoded_feature = tf.expand_dims(inputs[feature_name], -1)
# Add the encoded feature to the list.
# Concatenate all the encoded features.
encoded_features = layers.concatenate(encoded_features, axis=1)
# Apply dropout.
encoded_features = layers.Dropout(rate=0.25)(encoded_features)
# Perform non-linearity projection.
encoded_features = layers.Dense(
units=size if size else encoded_features.shape[-1], activation="gelu"
# Create and return a Keras model with encoded features as outputs.
return keras.Model(inputs=inputs, outputs=encoded_features)
def create_nn_model(encoder):
inputs = create_model_inputs()
embeddings = encoder(inputs)
output = layers.Dense(units=1, activation="sigmoid")(embeddings)
nn_model = keras.Model(inputs=inputs, outputs=output)
return nn_model
embedding_encoder = create_embedding_encoder(size=64)
Epoch 1/5
200/200 [==============================] - 10s 27ms/step - loss: 8303.1455 - accuracy: 0.9193
Epoch 2/5
200/200 [==============================] - 5s 27ms/step - loss: 1019.4900 - accuracy: 0.9371
Epoch 3/5
200/200 [==============================] - 5s 27ms/step - loss: 612.2844 - accuracy: 0.9416
Epoch 4/5
200/200 [==============================] - 5s 27ms/step - loss: 858.9774 - accuracy: 0.9397
Epoch 5/5
200/200 [==============================] - 5s 26ms/step - loss: 842.3922 - accuracy: 0.9421
Test accuracy: 95.0%
gbt_model = create_gbt_with_preprocessor(embedding_encoder)
run_experiment(gbt_model, train_data, test_data)
Use /tmp/tmpao5o88p6 as temporary training directory
Starting reading the dataset
199/200 [============================>.] - ETA: 0s
Dataset read in 0:00:06.722677
Training model
Model trained in 0:05:18.350298
Compiling model
200/200 [==============================] - 325s 2s/step
Test accuracy: 95.82%
TensorFlow 決策樹森林提供了強大的模型,尤其是在處理結構化資料時。在我們的實驗中,梯度提升樹模型達到了 95.79% 的測試準確率。當使用目標編碼分類特徵時,相同的模型達到了 95.81% 的測試準確率。當預訓練嵌入以用作梯度提升樹模型的輸入時,我們達到了 95.82% 的測試準確率。
決策樹森林可以與神經網路一起使用,方法是:1) 使用神經網路來學習輸入資料的有用表示,然後使用決策樹森林進行監督式學習任務,或 2) 建立決策樹森林和神經網路模型的集成。
請注意,TensorFlow 決策樹森林 (目前) 不支援硬體加速器。所有訓練和推論都是在 CPU 上完成的。此外,決策樹森林需要適合記憶體的有限資料集才能進行訓練程序。但是,增加資料集的大小所帶來的收益會遞減,並且決策樹森林演算法的收斂所需的範例數量可能比大型神經網路模型少。