程式碼範例 / 電腦視覺 / 使用 CNN-RNN 架構進行影片分類

使用 CNN-RNN 架構進行影片分類

作者: Sayak Paul
建立日期 2021/05/28
上次修改日期 2023/12/08
描述: 在 UCF101 資料集上使用遷移學習和遞迴模型訓練影片分類器。

ⓘ 這個範例使用 Keras 3

在 Colab 中檢視 GitHub 原始碼

這個範例示範影片分類,這是一個重要的使用案例,應用於推薦、安全等方面。我們將使用 UCF101 資料集來建立我們的影片分類器。該資料集包含分類為不同動作的影片,例如板球擊球、拳擊、騎自行車等。這個資料集通常用於建立動作辨識器,這是影片分類的一個應用。

影片由排序的影格序列組成。每個影格包含空間資訊,而這些影格的序列包含時間資訊。為了模擬這兩個方面,我們使用混合架構,該架構由卷積 (用於空間處理) 和遞迴層 (用於時間處理) 組成。具體來說,我們將使用卷積神經網路 (CNN) 和由 GRU 層組成的遞迴神經網路 (RNN)。這種混合架構通常稱為 CNN-RNN

此範例需要 TensorFlow 2.5 或更高版本,以及 TensorFlow Docs,可以使用以下命令安裝

!pip install -q git+https://github.com/tensorflow/docs

資料收集

為了保持這個範例的執行時間相對較短,我們將使用原始 UCF101 資料集的子採樣版本。您可以參考這個筆記本,了解如何進行子採樣。

!!wget -q https://github.com/sayakpaul/Action-Recognition-in-TensorFlow/releases/download/v1.0.0/ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

設定

import os

import keras
from imutils import paths

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import imageio
import cv2
from IPython.display import Image

定義超參數

IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

資料準備

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)
Total videos for training: 594
Total videos for testing: 224
影片名稱 標籤
492 v_TennisSwing_g10_c03.avi TennisSwing
536 v_TennisSwing_g16_c05.avi TennisSwing
413 v_ShavingBeard_g16_c05.avi ShavingBeard
268 v_Punch_g12_c04.avi Punch
288 v_Punch_g15_c03.avi Punch
30 v_CricketShot_g12_c03.avi CricketShot
449 v_ShavingBeard_g21_c07.avi ShavingBeard
524 v_TennisSwing_g14_c07.avi TennisSwing
145 v_PlayingCello_g12_c01.avi PlayingCello
566 v_TennisSwing_g21_c03.avi TennisSwing

訓練影片分類器的一大挑戰是如何將影片輸入到網路中。這篇部落格文章討論了五種方法。由於影片是一連串有序的幀,我們可以簡單地提取幀並將它們放入 3D 張量中。但是,每個影片的幀數可能不同,這會阻止我們將它們堆疊成批次(除非我們使用填充)。另一種方法是以固定間隔儲存影片幀,直到達到最大幀數。在這個範例中,我們將執行以下操作

  1. 擷取影片的幀。
  2. 從影片中提取幀,直到達到最大幀數。
  3. 在影片的幀數小於最大幀數的情況下,我們將用零填充影片。

請注意,此工作流程與涉及文字序列的問題相同。UCF101 資料集的影片已知在不同幀之間,物體和動作不會有極大的變化。因此,只考慮少數幀來進行學習任務可能是可以接受的。但是,這種方法可能無法很好地推廣到其他影片分類問題。我們將使用OpenCV 的 VideoCapture() 方法來讀取影片中的幀。

# The following two methods are taken from this tutorial:
# https://tensorflow.dev.org.tw/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

我們可以利用預先訓練的網路來從提取的幀中提取有意義的特徵。Keras Applications 模組提供了許多在 ImageNet-1k 資料集上預先訓練的最新模型。我們將為此目的使用 InceptionV3 模型

def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

影片的標籤是字串。神經網路無法理解字串值,因此它們在輸入到模型之前必須轉換為某種數字形式。在這裡,我們將使用 StringLookup 層將類別標籤編碼為整數。

label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())
['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']

最後,我們可以將所有部分組裝在一起以建立我們的資料處理工具。

def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[..., None]))

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(
            shape=(
                1,
                MAX_SEQ_LENGTH,
            ),
            dtype="bool",
        )
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :], verbose=0,
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")
Frame features in train set: (594, 20, 2048)
Frame masks in train set: (594, 20)

上述程式碼區塊將需要約 20 分鐘才能執行,具體時間取決於執行它的機器。


序列模型

現在,我們可以將這些資料輸入到由 GRU 等循環層組成的序列模型。

# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.dev.org.tw/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model


# Utility for running experiments.
def run_experiment():
    filepath = "/tmp/video_classifier/ckpt.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model


_, sequence_model = run_experiment()
Epoch 1/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.3058 - loss: 1.5597 
Epoch 1: val_loss improved from inf to 1.78077, saving model to /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 2s 36ms/step - accuracy: 0.3127 - loss: 1.5531 - val_accuracy: 0.1397 - val_loss: 1.7808
Epoch 2/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.5216 - loss: 1.2704
Epoch 2: val_loss improved from 1.78077 to 1.78026, saving model to /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.5226 - loss: 1.2684 - val_accuracy: 0.1788 - val_loss: 1.7803
Epoch 3/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6189 - loss: 1.1656
Epoch 3: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6174 - loss: 1.1651 - val_accuracy: 0.2849 - val_loss: 1.8322
Epoch 4/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6518 - loss: 1.0645
Epoch 4: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.6515 - loss: 1.0647 - val_accuracy: 0.2793 - val_loss: 2.0419
Epoch 5/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6833 - loss: 0.9976
Epoch 5: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6843 - loss: 0.9965 - val_accuracy: 0.3073 - val_loss: 1.9077
Epoch 6/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7229 - loss: 0.9312
Epoch 6: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.7241 - loss: 0.9305 - val_accuracy: 0.3017 - val_loss: 2.1513
Epoch 7/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8023 - loss: 0.9132
Epoch 7: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8035 - loss: 0.9093 - val_accuracy: 0.3184 - val_loss: 2.1705
Epoch 8/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8127 - loss: 0.8380
Epoch 8: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8128 - loss: 0.8356 - val_accuracy: 0.3296 - val_loss: 2.2043
Epoch 9/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8494 - loss: 0.7641
Epoch 9: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8494 - loss: 0.7622 - val_accuracy: 0.3017 - val_loss: 2.3734
Epoch 10/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8634 - loss: 0.6883
Epoch 10: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8649 - loss: 0.6882 - val_accuracy: 0.3240 - val_loss: 2.4410
 7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7816 - loss: 1.0624 
Test accuracy: 56.7%

注意:為了讓此範例的執行時間相對較短,我們只使用了幾個訓練範例。相對於正在使用的具有 99,909 個可訓練參數的序列模型,這個訓練範例的數量較少。建議您使用上述筆記本從 UCF101 資料集中採樣更多資料,並訓練相同的模型。


推論

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(
        shape=(
            1,
            MAX_SEQ_LENGTH,
        ),
        dtype="bool",
    )
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://tensorflow.dev.org.tw/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, duration=100)
    return Image("animation.gif")


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])
Test video path: v_TennisSwing_g03_c01.avi
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 32ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 166ms/step
  CricketShot: 46.99%
  ShavingBeard: 18.83%
  TennisSwing: 14.65%
  Punch: 12.41%
  PlayingCello:  7.12%

<IPython.core.display.Image object>

後續步驟

  • 在這個範例中,我們利用了遷移學習來從影片幀中提取有意義的特徵。您也可以微調預先訓練的網路,以觀察這如何影響最終結果。
  • 為了在速度和準確性之間取得平衡,您可以嘗試 keras.applications 內的其他模型。
  • 嘗試不同的 MAX_SEQ_LENGTH 組合,以觀察這如何影響效能。
  • 訓練更多類別,看看您是否能夠獲得良好的效能。
  • 按照本教學,嘗試 DeepMind 的預先訓練的動作辨識模型
  • 滾動平均對於影片分類可能是一個有用的技術,並且它可以與標準圖像分類模型結合使用,以推斷影片。本教學將幫助您了解如何將滾動平均與圖像分類器一起使用。
  • 當影片的幀之間存在變化時,並非所有幀對於決定其類別都同等重要。在這些情況下,在序列模型中加入自注意力層可能會產生更好的結果。
  • 按照本書章節,您可以實現基於 Transformers 的模型來處理影片。