使用 tf.data 的单热编码混淆了列

Question

最少工作示例

考虑以下 CSV 文件 (example.csv)

animal,size,weight,category
lion,large,200,mammal
ostrich,large,150,bird
sparrow,small,0.1,bird
whale,large,3000,mammal
bat,small,0.2,mammal
snake,small,1,reptile
condor,medium,12,bird

目标是将所有分类值转换为单热编码。 standard 在 Tensorflow 2.0 中执行此操作的方法是使用 tf.data。按照这个例子，处理上面数据集的代码是

import collections
import tensorflow as tf

# Load the dataset.
dataset = tf.data.experimental.make_csv_dataset(
    'example.csv',
    batch_size=5,
    num_epochs=1,
    shuffle=False)

# Specify the vocabulary for each category.
categories = collections.OrderedDict()
categories['animal'] = ['lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor']
categories['size'] = ['large', 'medium', 'small']
categories['category'] = ['mammal', 'reptile', 'bird']

# Define the categorical feature columns.
categorical_columns = []
for feature, vocab in categories.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

# Retrieve the first batch and apply the one-hot encoding to it.
iterator = iter(dataset)
first_batch = next(iterator)
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)

print(categorical_layer(first_batch).numpy())

问题

运行上面的代码，一个得到

[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.]]

最后两列 size 和 category 似乎被翻转了，尽管 categories 是 有序的 字典和实际数据集中列的预先存在的顺序。就好像 tf.feature_column.categorical_column_with_vocabulary_list() 对列进行了一些毫无根据的字母顺序排序。

以上是什么原因。这真的是本着 tf.data 的精神进行单热编码的最佳方法吗？

Answer 1

排序在哪里？

排序未在 tf.feature_column.categorical_column_with_vocabulary_list() 进行。如果您打印 categorical_columns，您将看到列仍然按照您将它们添加到 feature_column:

的顺序排列

[
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='animal', vocabulary_list=('lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='size', vocabulary_list=('large', 'medium', 'small'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='category', vocabulary_list=('mammal', 'reptile', 'bird'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
]

排序发生在 tf.keras.layers.DenseFeatures 对象中。

在代码中，您可以看到排序发生在 here (I found this by tracing the class inheritance from the tf.keras.layers.DenseFeatures class to the tensorflow.python.feature_column.dense_features.DenseFeatures class to the tensorflow.python.feature_column.feature_column_v2._BaseFeaturesLayer class to the _normalize_feature_columns 函数的位置。

为什么排序？

那么为什么要排序呢？ Elsewhere在包含_normalize_feature_columns函数（就是对数据进行排序的函数）的同一个文件中，有一个类似的排序函数，注释如下：

# Sort the columns so the default collection name is deterministic even if the
# user passes columns from an unsorted collection, such as dict.values().

我认为这个解释也适用于为什么在使用 tf.keras.layers.DenseFeatures class 时对列进行排序。您的列和数据是一致的，但 tensorflow 不假设输入是一致的，因此它对其进行排序以确保一致的顺序。

使用 tf.data 的单热编码混淆了列

One-hot encoding using tf.data mixes up columns

python

tensorflow

one-hot-encoding

tensorflow2.0

最少工作示例

问题

排序在哪里？

为什么排序？