将 tf.Dataset 拆分为测试和验证子集的规范方法是什么?
What is the canonical way to split tf.Dataset into test and validation subsets?
问题
我正在关注 Tensorflow 2 tutorial 如何使用纯 Tensorflow 加载图像,因为它应该比 Keras 更快。本教程在展示如何将生成的数据集 (~tf.Dataset
) 拆分为训练和验证数据集之前结束。
我检查了 reference for tf.Dataset,它不包含 split()
方法。
我尝试手动切片,但 tf.Dataset
既不包含 size()
也不包含 length()
方法,所以我不知道如何自己切片.
我不能使用 Model.fit()
的 validation_split
参数,因为我需要扩充训练数据集而不是验证数据集。
问题
拆分 tf.Dataset
的预期方法是什么,或者我应该使用不同的工作流程来避免这种情况?
示例代码
(来自教程)
BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))
def get_label(file_path):
# convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
return parts[-2] == CLASS_NAMES
def decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# Use `convert_image_dtype` to convert to floats in the [0,1] range.
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
def process_path(file_path):
label = get_label(file_path)
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
#...
#...
我可以拆分 list_ds
(文件列表)或 labeled_ds
(图像和标签列表),但是如何拆分?
我认为没有规范的方式(通常,数据被分割,例如在不同的目录中)。但这里有一个方法可以让你动态地做到这一点:
# Caveat: cache list_ds, otherwise it will perform the directory listing twice.
ds = list_ds.cache()
# Add some indices.
ds = ds.enumerate()
# Do a rougly 70-30 split.
train_list_ds = ds.filter(lambda i, data: i % 10 < 7)
test_list_ds = ds.filter(lambda i, data: i % 10 >= 7)
# Drop indices.
train_list_ds = train_list_ds.map(lambda i, data: data)
test_list_ds = test_list_ds.map(lambda i, data: data)
根据 Dan Moldovan 的回答,我创建了一个可重用的函数。也许这对其他人有用。
def split_dataset(dataset: tf.data.Dataset, validation_data_fraction: float):
"""
Splits a dataset of type tf.data.Dataset into a training and validation dataset using given ratio. Fractions are
rounded up to two decimal places.
@param dataset: the input dataset to split.
@param validation_data_fraction: the fraction of the validation data as a float between 0 and 1.
@return: a tuple of two tf.data.Datasets as (training, validation)
"""
validation_data_percent = round(validation_data_fraction * 100)
if not (0 <= validation_data_percent <= 100):
raise ValueError("validation data fraction must be ∈ [0,1]")
dataset = dataset.enumerate()
train_dataset = dataset.filter(lambda f, data: f % 100 > validation_data_percent)
validation_dataset = dataset.filter(lambda f, data: f % 100 <= validation_data_percent)
# remove enumeration
train_dataset = train_dataset.map(lambda f, data: data)
validation_dataset = validation_dataset.map(lambda f, data: data)
return train_dataset, validation_dataset
问题
我正在关注 Tensorflow 2 tutorial 如何使用纯 Tensorflow 加载图像,因为它应该比 Keras 更快。本教程在展示如何将生成的数据集 (~tf.Dataset
) 拆分为训练和验证数据集之前结束。
我检查了 reference for tf.Dataset,它不包含
split()
方法。我尝试手动切片,但
tf.Dataset
既不包含size()
也不包含length()
方法,所以我不知道如何自己切片.我不能使用
Model.fit()
的validation_split
参数,因为我需要扩充训练数据集而不是验证数据集。
问题
拆分 tf.Dataset
的预期方法是什么,或者我应该使用不同的工作流程来避免这种情况?
示例代码
(来自教程)
BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))
def get_label(file_path):
# convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
return parts[-2] == CLASS_NAMES
def decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# Use `convert_image_dtype` to convert to floats in the [0,1] range.
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
def process_path(file_path):
label = get_label(file_path)
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
#...
#...
我可以拆分 list_ds
(文件列表)或 labeled_ds
(图像和标签列表),但是如何拆分?
我认为没有规范的方式(通常,数据被分割,例如在不同的目录中)。但这里有一个方法可以让你动态地做到这一点:
# Caveat: cache list_ds, otherwise it will perform the directory listing twice.
ds = list_ds.cache()
# Add some indices.
ds = ds.enumerate()
# Do a rougly 70-30 split.
train_list_ds = ds.filter(lambda i, data: i % 10 < 7)
test_list_ds = ds.filter(lambda i, data: i % 10 >= 7)
# Drop indices.
train_list_ds = train_list_ds.map(lambda i, data: data)
test_list_ds = test_list_ds.map(lambda i, data: data)
根据 Dan Moldovan 的回答,我创建了一个可重用的函数。也许这对其他人有用。
def split_dataset(dataset: tf.data.Dataset, validation_data_fraction: float):
"""
Splits a dataset of type tf.data.Dataset into a training and validation dataset using given ratio. Fractions are
rounded up to two decimal places.
@param dataset: the input dataset to split.
@param validation_data_fraction: the fraction of the validation data as a float between 0 and 1.
@return: a tuple of two tf.data.Datasets as (training, validation)
"""
validation_data_percent = round(validation_data_fraction * 100)
if not (0 <= validation_data_percent <= 100):
raise ValueError("validation data fraction must be ∈ [0,1]")
dataset = dataset.enumerate()
train_dataset = dataset.filter(lambda f, data: f % 100 > validation_data_percent)
validation_dataset = dataset.filter(lambda f, data: f % 100 <= validation_data_percent)
# remove enumeration
train_dataset = train_dataset.map(lambda f, data: data)
validation_dataset = validation_dataset.map(lambda f, data: data)
return train_dataset, validation_dataset