TensorFlow 2.6：num_parallel_calls 大于 1 但大部分时间只使用一个 CPU 核

Question

我写了一个 TF 数据管道，看起来像这样 (TF 2.6)：

def parse(img):
    image = tf.image.decode_png(img, channels=3)
    image = tf.reshape(image, IMG_SHAPE)
    image = tf.cast(image, TARGET_DTYPE)
    return image


def decode_batch(serialized_example, is_test=False):
    feature_dict = {
        'image': tf.io.FixedLenFeature(shape=[], dtype=tf.string, default_value=''),
    }
    
    if not is_test:
        feature_dict["some_text"] = tf.io.FixedLenFeature(shape=[MAX_LEN], dtype=tf.int64, default_value=[0]*MAX_LEN)
    else:
        feature_dict["image_id"] = tf.io.FixedLenFeature(shape=[], dtype=tf.string, default_value='')

    features = tf.io.parse_example(tf.reshape(serialized_example, [BATCH_SIZE_OVERALL]), features=feature_dict)
    images = tf.map_fn(parse, features['image'], parallel_iterations=4, fn_output_signature=TARGET_DTYPE)

    if is_test:
        image_ids = features["image_id"] 
        return images, image_ids
    else:
        targets = tf.cast(features["some_text"], tf.uint8)
        return images, targets


def get_dataset(filenames, is_test):
    opts = tf.data.Options()
    opts.experimental_deterministic = False
    dataset = tf.data.Dataset.from_tensor_slices(filenames)
    dataset = dataset.with_options(opts)
    dataset = dataset.interleave(lambda x:
        tf.data.TFRecordDataset(x),
        cycle_length=4,
        num_parallel_calls=4,
    )
    dataset = dataset.batch(BATCH_SIZE_OVERALL, num_parallel_calls=4, drop_remainder=True)
    if not is_test:
        dataset = dataset.repeat()
        dataset = dataset.shuffle(BATCH_SIZE_OVERALL*6)
    dataset = dataset.map(lambda y: decode_batch(y, is_test), num_parallel_calls=4)

    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    return dataset


train_ds = get_dataset(TRAIN_TFREC_PATHS, False)

正如您从代码中看到的那样，我使用了 TF 指南中关于正确构建 tf.data 管道的大部分技巧。我遇到的问题如下：开始训练时，代码并没有使用全部 4 个核心，而只使用了 1 个（有时使用更多的核心，但似乎是由下面代码中的 train_dist_ds.get_next() 调用引起的） .此外，GPU 几乎完全没有被利用。 profiler说问题出在preprocessing，在tf_data_bottleneck_analysis说明问题出在ParallelBatch（虽然有一次他指向ParallelMap，貌似是真的，但这并没有说明很多本身 - 核心仍然没有得到充分利用）。使用分析器的训练函数如下所示：

def fit_profile(train_ds, val_ds, stop_after_steps):
    tf.profiler.experimental.start('logdir')
    stat_logger.current_step = 0

    train_dist_ds = iter(train_ds)

    while True:
        stat_logger.batch_start_time = time.time()
        stat_logger.current_step += 1
        print(f'current step: {stat_logger.current_step}')
        with tf.profiler.experimental.Trace('train', step_num=stat_logger.current_step, _r=1):
            image_batch, some_text_batch = train_dist_ds.get_next()
        train_step(image_batch, some_text_batch)
        if stat_logger.current_step == stop_after_steps:
            break
            
    tf.profiler.experimental.stop()

如你所见，我没有触及数据集，我没有将其放入任何策略中，它在 train_step 中（当然包裹在 @tf.function 中）。问题：有没有办法以某种方式调试图中 tf.data 操作的计算？特别是，在预处理中对每个 tf.data API 函数的调用级别——这样我就可以理解到底要优化什么。只用一颗核的原因是什么？

到目前为止我尝试过的：

将所有可自动调整的参数设置为 tf.data.AUTOTUNE - 无效；
单独迭代数据集对象——在这种情况下使用了所有核心，由此我得出结论，问题出在图形执行级别——并行性没有全局关闭；
关闭分析器 - 无效；
降低 map_fn 调用中 parallel_iterations 的数量 - 无效；
很多奇怪的设置 num_parallel_calls - 没有影响到看起来真的无关紧要的程度。

Answer 1

我终于找到了这种行为的原因。这是由于使用 XLA 和 GPU 造成的。

我突然发现 this，并决定关闭 XLA，天哪，经过将近一周的调查，GPU 得到了充分利用，训练时间变得更加理智（之前它们是相等的到 CPU 训练时间！！）。正如文章中所写：1) XLA 中的 GPU 支持是实验性的； 2）张量需要具有可推断的形状； 3) XLA 必须支持图中的所有操作。此类问题的迹象是 CPU 和 GPU 利用率低，以及训练步骤跳动，即一步需要 150 秒，接下来的 8-10 步各需要 1 秒，然后重复这种模式。这篇文章讨论了 TF 1.x，但到目前为止，这个主题似乎没有太大变化（同样，我使用的是 TF 2.6）。

要点：

不要盲目地将 XLA 与 GPU 一起使用，它可能会将您的 GPU 训练时间降低到 CPU 水平（如果使用不当）。
如果您将 XLA 与 GPU 一起使用，请确保您满足上述要求。

如果我在计算中设法满足这些 XLA 要求并打开 XLA 以提高性能而不是降低性能，我将更新此答案。

TensorFlow 2.6：num_parallel_calls 大于 1 但大部分时间只使用一个 CPU 核

TensorFlow 2.6: num_parallel_calls is greater than 1 but only one CPU core is used most of the time

python

parallel-processing

pipeline

tensorflow

tf.data.dataset