如何批量写入 TFRecords？

Question

我有一个包含大约 4000 万行的 CSV。每行都是一个训练实例。根据 the documentation on consuming TFRecords，我正在尝试对数据进行编码并将其保存在 TFRecord 文件中。

我找到的所有示例 (even the ones in the TensorFlow repo) 都显示创建 TFRecord 的过程依赖于 class TFRecordWriter。这个 class 有一个方法 write 将数据的序列化字符串表示形式作为输入并将其写入磁盘。但是，这似乎是一次完成一个训练实例。

如何批量写入序列化数据？

假设我有一个功能：

  def write_row(sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example.SerializeToString())

写入磁盘 4000 万次（每个示例一次）将非常慢。批量处理这些数据并一次编写 50k 或 100k 示例（在机器资源允许的范围内）会更有效率。但是在 TFRecordWriter 中似乎没有任何方法可以做到这一点。

大致如下：

class MyRecordWriter:

  def __init__(self, writer):
    self.records = []
    self.counter = 0
    self.writer = writer

  def write_row_batched(self, sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    self.records.append(example.SerializeToString())
    self.counter += 1
    if self.counter >= 10000:
      self.writer.write(os.linesep.join(self.records))
      self.counter = 0
      self.records = []

但是在读取用这种方法创建的文件时出现以下错误：

tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Could not parse example input, value: '
��

label

��
one_hot����
��

注意：我可以更改编码过程，使每个 example 原型包含数千个示例，而不是只有一个，但是我不想在以这种方式写入 TFrecord 文件时对数据进行预批处理，因为当我想使用该文件进行不同批量大小的训练时，它会在我的训练管道中引入额外的开销。

Answer 1

TFRecords 是一种二进制格式。使用以下行，您将其视为文本文件：self.writer.write(os.linesep.join(self.records))

那是因为您使用的操作系统取决于 linesep（\n 或 \r\n）。

解决方法：只写记录。您要求批量写入它们。您可以使用缓冲的编写器。对于 4000 万行，您可能还需要考虑将数据拆分到单独的文件中以实现更好的并行化。

使用TFRecordWriter时：文件已经缓冲。

在来源中找到了证据：

tf_record.py 呼叫 pywrap_tensorflow.PyRecordWriter_New
PyRecordWriter 调用 Env::Default()->NewWritableFile
Env->NewWritableFile 在匹配的文件系统

NewWritableFile

例如PosixFileSystem 呼叫 fopen
fopen returns 一个流 "is fully buffered by default if it is known to not refer to an interactive device"
这将取决于文件系统，但 WritableFile 注意 "The implementation must provide buffering since callers may append small fragments at a time to the file."

如何批量写入 TFRecords？

How to bulk write TFRecords?

python

file-writing

tensorflow