如何通过 tensorflow 加载泡菜文件 tf.data API

How to load pickle files by tensorflow's tf.data API

我的数据存储在磁盘上的多个 pickle 文件中。我想使用 tensorflow 的 tf.data.Dataset 将我的数据加载到训练管道中。我的代码是:

def _parse_file(path):
    image, label = *load pickle file*
    return image, label
paths = glob.glob('*.pkl')
print(len(paths))
dataset = tf.data.Dataset.from_tensor_slices(paths)
dataset = dataset.map(_parse_file)
iterator = dataset.make_one_shot_iterator()

问题是我不知道如何实现 _parse_file 功能。此函数的参数 path 是张量类型。我试过了

def _parse_file(path):
    with tf.Session() as s:
        p = s.run(path)
        image, label = pickle.load(open(p, 'rb'))
    return image, label

并收到错误消息:

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'arg0' with dtype string
     [[Node: arg0 = Placeholder[dtype=DT_STRING, shape=<unknown>, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

在网上搜索了一番,还是不知道该怎么做。如果有人给我提示,我将不胜感激。

我自己解决了这个问题。我应该像 doc.

一样使用 tf.py_func

tf.py_func 这个函数就是用来解决那个问题的,也是文档中提到的。

我就是这样解决这个问题的。我没有使用 tf.py_func;查看下面的函数“load_encoding()”,这是读取 pickle 的内容。 FACELIB_DIR 包含 pickled vggface2 编码的目录,每个目录以这些人脸编码的人命名。

import tensorflow as tf
import pickle
import os

FACELIB_DIR='/var/noggin/FaceEncodings'

# Get list of all classes & build a quick int-lookup dictionary
labelNames = sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')])
labelStrToInt = dict([(x,i) for i,x in enumerate(labelNames)])

# Function load_encoding - Loads Encoding data from enc2048 file in filepath
#    This reads an encoding from disk, and through the file path gets the label oneHot value, returns both
def load_encoding(file_path):
    with open(os.path.join(FACELIB_DIR,file_path),'rb') as fin:
        A,_ = pickle.loads(fin.read())    # encodings, source_image_name
    label_str = tf.strings.split(file_path, os.path.sep)[-2]
    return (A, labelStrToInt[label_str])

# Build the dataset of every enc2048 file in our data library
encpaths = []
for D in sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')]):
    # All the encoding files
    encfiles = sorted(filter((lambda x: x.endswith('.enc2048')), os.listdir(os.path.join(FACELIB_DIR, D))))
    encpaths += [os.path.join(D,x) for x in encfiles]
dataset = tf.data.Dataset.from_tensor_slices(encpaths)

# Shuffle and speed improvements on the dataset
BATCH_SIZE = 64
from tensorflow.data import AUTOTUNE
dataset = (dataset
    .shuffle(1024)
    .cache()
    .repeat()
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)
    
# Benchmark our tf.data pipeline
import time
datasetGen = iter(dataset)
NUM_STEPS = 10000
start_time = time.time()
for i in range(0, NUM_STEPS):
    X = next(datasetGen)
totalTime = time.time() - start_time
print('==> tf.data generated {} tensors in {:.2f} seconds'.format(BATCH_SIZE * NUM_STEPS, totalTime))