建立 TFRecords

作者： Dimitre Oliveira
建立日期 2021/02/27
上次修改日期 2023/12/20
說明： 將資料轉換為 TFRecord 格式。

ⓘ 這個範例使用 Keras 3

簡介

TFRecord 格式是一種用於儲存二進位記錄序列的簡單格式。將您的資料轉換為 TFRecord 有許多優點，例如

更有效率的儲存：TFRecord 資料可以比原始資料佔用更少的空間；它也可以分割成多個檔案。
快速 I/O：TFRecord 格式可以使用平行 I/O 操作讀取，這對於 TPU 或多個主機非常有用。
獨立檔案：TFRecord 資料可以從單一來源讀取—例如，COCO2017 資料集最初將資料儲存在兩個資料夾（「images」和「annotations」）。

TFRecord 資料格式的一個重要用例是在 TPU 上進行訓練。首先，TPU 的速度夠快，可以從最佳化的 I/O 操作中獲益。此外，TPU 需要將資料遠端儲存（例如，在 Google Cloud Storage 上），而使用 TFRecord 格式可以更輕鬆地載入資料，而無需批次下載。

如果您也將 TFRecord 格式與 tf.data API 一起使用，則可以進一步提高效能。

在這個範例中，您將學習如何將不同類型的資料（影像、文字和數值）轉換為 TFRecord。

參考

TFRecord 和 tf.train.Example

相依性

import os

os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
import json
import pprint
import tensorflow as tf
import matplotlib.pyplot as plt

下載 COCO2017 資料集

我們將使用 COCO2017 資料集，因為它具有許多不同類型的功能，包括影像、浮點資料和清單。它將作為如何將不同功能編碼為 TFRecord 格式的良好範例。

此資料集有兩組欄位：影像和註釋元資料。

影像是一系列 JPG 檔案，而元資料儲存在 JSON 檔案中，根據官方網站，包含以下屬性

id: int,
image_id: int,
category_id: int,
segmentation: RLE or [polygon], object segmentation mask
bbox: [x,y,width,height], object bounding box coordinates
area: float, area of the bounding box
iscrowd: 0 or 1, is single object or a collection

root_dir = "datasets"
tfrecords_dir = "tfrecords"
images_dir = os.path.join(root_dir, "val2017")
annotations_dir = os.path.join(root_dir, "annotations")
annotation_file = os.path.join(annotations_dir, "instances_val2017.json")
images_url = "http://images.cocodataset.org/zips/val2017.zip"
annotations_url = (
    "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
)

# Download image files
if not os.path.exists(images_dir):
    image_zip = keras.utils.get_file(
        "images.zip",
        cache_dir=os.path.abspath("."),
        origin=images_url,
        extract=True,
    )
    os.remove(image_zip)

# Download caption annotation files
if not os.path.exists(annotations_dir):
    annotation_zip = keras.utils.get_file(
        "captions.zip",
        cache_dir=os.path.abspath("."),
        origin=annotations_url,
        extract=True,
    )
    os.remove(annotation_zip)

print("The COCO dataset has been downloaded and extracted successfully.")

with open(annotation_file, "r") as f:
    annotations = json.load(f)["annotations"]

print(f"Number of images: {len(annotations)}")

Downloading data from http://images.cocodataset.org/zips/val2017.zip
 815585330/815585330 ━━━━━━━━━━━━━━━━━━━━ 79s 0us/step
Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2017.zip
 252907541/252907541 ━━━━━━━━━━━━━━━━━━━━ 5s 0us/step
The COCO dataset has been downloaded and extracted successfully.
Number of images: 36781

COCO2017 資料集的內容

pprint.pprint(annotations[60])

{'area': 367.89710000000014,
 'bbox': [265.67, 222.31, 26.48, 14.71],
 'category_id': 72,
 'id': 34096,
 'image_id': 525083,
 'iscrowd': 0,
 'segmentation': [[267.51,
                   222.31,
                   292.15,
                   222.31,
                   291.05,
                   237.02,
                   265.67,
                   237.02]]}

參數

num_samples 是每個 TFRecord 檔案上的資料樣本數。

num_tfrecords 是我們將建立的 TFRecord 總數。

num_samples = 4096
num_tfrecords = len(annotations) // num_samples
if len(annotations) % num_samples:
    num_tfrecords += 1  # add one record if there are any remaining samples

if not os.path.exists(tfrecords_dir):
    os.makedirs(tfrecords_dir)  # creating TFRecords output folder

定義 TFRecord 輔助函式

def image_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(
        bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()])
    )


def bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))


def float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def float_feature_list(value):
    """Returns a list of float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))


def create_example(image, path, example):
    feature = {
        "image": image_feature(image),
        "path": bytes_feature(path),
        "area": float_feature(example["area"]),
        "bbox": float_feature_list(example["bbox"]),
        "category_id": int64_feature(example["category_id"]),
        "id": int64_feature(example["id"]),
        "image_id": int64_feature(example["image_id"]),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))


def parse_tfrecord_fn(example):
    feature_description = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "path": tf.io.FixedLenFeature([], tf.string),
        "area": tf.io.FixedLenFeature([], tf.float32),
        "bbox": tf.io.VarLenFeature(tf.float32),
        "category_id": tf.io.FixedLenFeature([], tf.int64),
        "id": tf.io.FixedLenFeature([], tf.int64),
        "image_id": tf.io.FixedLenFeature([], tf.int64),
    }
    example = tf.io.parse_single_example(example, feature_description)
    example["image"] = tf.io.decode_jpeg(example["image"], channels=3)
    example["bbox"] = tf.sparse.to_dense(example["bbox"])
    return example

以 TFRecord 格式產生資料

讓我們以 TFRecord 格式產生 COCO2017 資料。格式將為 file_{number}.tfrec（這是可選的，但在檔案名稱中包含數字序列可以使計數更容易）。

for tfrec_num in range(num_tfrecords):
    samples = annotations[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]

    with tf.io.TFRecordWriter(
        tfrecords_dir + "/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
    ) as writer:
        for sample in samples:
            image_path = f"{images_dir}/{sample['image_id']:012d}.jpg"
            image = tf.io.decode_jpeg(tf.io.read_file(image_path))
            example = create_example(image, image_path, sample)
            writer.write(example.SerializeToString())

探索從產生的 TFRecord 中取樣

raw_dataset = tf.data.TFRecordDataset(f"{tfrecords_dir}/file_00-{num_samples}.tfrec")
parsed_dataset = raw_dataset.map(parse_tfrecord_fn)

for features in parsed_dataset.take(1):
    for key in features.keys():
        if key != "image":
            print(f"{key}: {features[key]}")

    print(f"Image shape: {features['image'].shape}")
    plt.figure(figsize=(7, 7))
    plt.imshow(features["image"].numpy())
    plt.show()

bbox: [473.07 395.93  38.65  28.67]
area: 702.1057739257812
category_id: 18
id: 1768
image_id: 289343
path: b'datasets/val2017/000000289343.jpg'
Image shape: (640, 529, 3)

png

使用產生的 TFRecord 訓練簡單模型

TFRecord 的另一個優點是您可以向其新增許多功能，然後稍後僅使用其中的一部分，在本例中，我們將僅使用 image 和 category_id。

定義資料集輔助函式

def prepare_sample(features):
    image = keras.ops.image.resize(features["image"], size=(224, 224))
    return image, features["category_id"]


def get_dataset(filenames, batch_size):
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
        .map(parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
        .map(prepare_sample, num_parallel_calls=AUTOTUNE)
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )
    return dataset


train_filenames = tf.io.gfile.glob(f"{tfrecords_dir}/*.tfrec")
batch_size = 32
epochs = 1
steps_per_epoch = 50
AUTOTUNE = tf.data.AUTOTUNE

input_tensor = keras.layers.Input(shape=(224, 224, 3), name="image")
model = keras.applications.EfficientNetB0(
    input_tensor=input_tensor, weights=None, classes=91
)


model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)


model.fit(
    x=get_dataset(train_filenames, batch_size),
    epochs=epochs,
    steps_per_epoch=steps_per_epoch,
    verbose=1,
)

 50/50 ━━━━━━━━━━━━━━━━━━━━ 146s 2s/step - loss: 3.9206 - sparse_categorical_accuracy: 0.1690

<keras.src.callbacks.history.History at 0x7f70684c27a0>

結論

此範例示範了您可以讓資料來自單一來源（歸功於 TFRecord），而無需從不同來源讀取影像和註釋。此過程可以使儲存和讀取資料更簡單、更有效率。如需更多資訊，您可以前往 TFRecord 和 tf.train.Example 教學課程。

建立 TFRecords

◆ 簡介

◆ 相依性

◆ 下載 COCO2017 資料集

COCO2017 資料集的內容

◆ 參數

◆ 定義 TFRecord 輔助函式

◆ 以 TFRecord 格式產生資料

◆ 探索從產生的 TFRecord 中取樣

◆ 使用產生的 TFRecord 訓練簡單模型

◆ 定義資料集輔助函式

◆ 結論