分散式超參數調整

作者： Tom O'Malley, Haifeng Jin
建立日期 2019/10/24
上次修改日期 2021/06/02
描述： 使用多個 GPU 和多個機器調整模型的超參數。

!pip install keras-tuner -q

簡介

KerasTuner 使執行分散式超參數搜尋變得容易。無需更改您的程式碼，即可從在本機單執行緒執行擴展到在數十或數百個工作節點上並行執行。分散式 KerasTuner 使用主從式模型。主節點執行一個服務，工作節點向該服務報告結果並查詢接下來要嘗試的超參數。主節點應在單執行緒 CPU 實例上執行（或者也可以作為其中一個工作節點上的單獨進程執行）。

設定分散式模式

為 KerasTuner 設定分散式模式只需要設定三個環境變數

KERASTUNER_TUNER_ID：主節點進程應設定為「chief」。其他工作節點應傳遞唯一 ID（按照慣例，為「tuner0」、「tuner1」等）。

KERASTUNER_ORACLE_IP：主節點服務應執行的 IP 位址或主機名稱。所有工作節點應能夠解析並存取此位址。

KERASTUNER_ORACLE_PORT：主節點服務應執行的連接埠。這可以自由選擇，但必須是其他工作節點可以存取的連接埠。實例透過 gRPC 協定進行通訊。

相同的程式碼可以在所有工作節點上執行。分散式模式的其他考量因素包括

所有工作節點都應有權存取中央檔案系統，以便寫入結果。
所有工作節點都應能夠存取調整所需的必要訓練和驗證資料。
為了支援容錯，Tuner.__init__ 中的 overwrite 應保持為 False (False 為預設值)。

主節點服務的範例 bash 腳本（run_tuning.py 的範例程式碼位於頁面底部）

export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8000"
python run_tuning.py

工作節點的範例 bash 腳本

export KERASTUNER_TUNER_ID="tuner0"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8000"
python run_tuning.py

使用 `tf.distribute` 進行資料平行處理

KerasTuner 也支援透過 tf.distribute 進行資料平行處理。資料平行處理和分散式調整可以結合使用。例如，如果您有 10 個工作節點，每個工作節點上有 4 個 GPU，則可以使用 tf.distribute.MirroredStrategy 執行 10 個並行試驗，每個試驗在 4 個 GPU 上進行訓練。您也可以透過 tf.distribute.TPUStrategy 在 TPU 上執行每個試驗。目前不支援 tf.distribute.MultiWorkerMirroredStrategy，但對此的支援已在路線圖中。

範例程式碼

當設定上述環境變數時，以下範例將執行分散式調整，並透過 tf.distribute 在每個試驗中進行資料平行處理。此範例從 tensorflow_datasets 載入 MNIST，並使用 Hyperband 進行超參數搜尋。

import keras
import keras_tuner
import tensorflow as tf
import numpy as np


def build_model(hp):
    """Builds a convolutional model."""
    inputs = keras.Input(shape=(28, 28, 1))
    x = inputs
    for i in range(hp.Int("conv_layers", 1, 3, default=3)):
        x = keras.layers.Conv2D(
            filters=hp.Int("filters_" + str(i), 4, 32, step=4, default=8),
            kernel_size=hp.Int("kernel_size_" + str(i), 3, 5),
            activation="relu",
            padding="same",
        )(x)

        if hp.Choice("pooling" + str(i), ["max", "avg"]) == "max":
            x = keras.layers.MaxPooling2D()(x)
        else:
            x = keras.layers.AveragePooling2D()(x)

        x = keras.layers.BatchNormalization()(x)
        x = keras.layers.ReLU()(x)

    if hp.Choice("global_pooling", ["max", "avg"]) == "max":
        x = keras.layers.GlobalMaxPooling2D()(x)
    else:
        x = keras.layers.GlobalAveragePooling2D()(x)
    outputs = keras.layers.Dense(10, activation="softmax")(x)

    model = keras.Model(inputs, outputs)

    optimizer = hp.Choice("optimizer", ["adam", "sgd"])
    model.compile(
        optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"]
    )
    return model


tuner = keras_tuner.Hyperband(
    hypermodel=build_model,
    objective="val_accuracy",
    max_epochs=2,
    factor=3,
    hyperband_iterations=1,
    distribution_strategy=tf.distribute.MirroredStrategy(),
    directory="results_dir",
    project_name="mnist",
    overwrite=True,
)

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Reshape the images to have the channel dimension.
x_train = (x_train.reshape(x_train.shape + (1,)) / 255.0)[:1000]
y_train = y_train.astype(np.int64)[:1000]
x_test = (x_test.reshape(x_test.shape + (1,)) / 255.0)[:100]
y_test = y_test.astype(np.int64)[:100]

tuner.search(
    x_train,
    y_train,
    steps_per_epoch=600,
    validation_data=(x_test, y_test),
    validation_steps=100,
    callbacks=[keras.callbacks.EarlyStopping("val_accuracy")],
)

Trial 2 Complete [00h 00m 18s]
val_accuracy: 0.07000000029802322

Best val_accuracy So Far: 0.07000000029802322
Total elapsed time: 00h 00m 26s

分散式超參數調整

◆ 簡介

設定分散式模式

使用 tf.distribute 進行資料平行處理

範例程式碼

分散式超參數調整

簡介

設定分散式模式

使用 tf.distribute 進行資料平行處理

範例程式碼

使用 `tf.distribute` 進行資料平行處理