在多特征 TensorFlow 数据集中引用和标记单个特征列
Referencing and tokenizing single feature column in multi-feature TensorFlow Dataset
我正在尝试标记 TensorFlow 数据集中的单个列。如果只有一个特征列,我一直使用的方法效果很好,例如:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
df = pd.DataFrame({"text": text,
"target": target})
training_dataset = (
tf.data.Dataset.from_tensor_slices((
tf.cast(df.text.values, tf.string),
tf.cast(df.target, tf.int32))))
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
然而,当我尝试在有一组特征列的地方执行此操作时,比如来自 make_csv_dataset
(每个特征列都被命名),上述方法失败了。 (ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.
).
我尝试使用以下方法在 for 循环中引用特定的特征列:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
gender = [1, 0]
age = [45, 35]
df = pd.DataFrame({"text": text,
"target": target,
"gender": gender,
"age": age})
df.to_csv('test.csv', index=False)
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target')
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
我收到错误:Expected binary or unicode string, got array([])
。引用单个特征列以便我可以标记化的正确方法是什么?通常,您可以在 .map
函数中使用 feature['column_name']
方法引用特征列,例如:
def new_age_func(features, target):
age = features['age']
features['age'] = age/2
return features, targets
dataset = dataset.map(new_age_func)
for features, target in dataset.take(2):
print('Features: {}, Target {}'.format(features, target))
我尝试组合方法并通过映射函数生成词汇表。
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
def vocab_generator(features, target):
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
dataset = dataset.map(vocab_generator)
但这会导致错误:
AttributeError: in user code:
<ipython-input-61-374e4c375b58>:10 vocab_generator *
tokens = tokenizer.tokenize(text.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
并将 tokenizer.tokenize(text.numpy())
更改为 tokenizer.tokenize(text)
会引发另一个错误 TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
错误只是 tokenizer.tokenize
需要一个字符串,而您给它的是一个列表。这个简单的编辑将起作用。我只是做了一个循环,将所有字符串提供给分词器,而不是给它一个字符串列表。
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target',
num_epochs=1)
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
for t in text:
tokens = tokenizer.tokenize(t.numpy())
vocabulary.update(tokens)
make_csv_dataset
创建的数据集的每个元素都是 CVS 文件的 批 行,而不是单行;这就是为什么它需要 batch_size
作为输入参数。另一方面,当前用于处理和标记文本特征的 for
循环每次需要单个输入样本(即行)。因此,tokenizer.tokenize
将在给定一批字符串的情况下失败并引发 TypeError: Expected binary or unicode string, got array(...)
.
用最少的更改解决此问题的一种方法是以某种方式首先 取消批处理 数据集,对数据集执行所有 pre-processings,然后 batch 数据集。幸运的是,这里有一个 built-in unbatch
方法可以使用:
dataset = tf.data.experimental.make_csv_dataset(
...,
# This change is **IMPORTANT**, otherwise the `for` loop would continue forever!
num_epochs=1
)
# Unbatch the dataset; this is required even if you have used `batch_size=1` above.
dataset = dataset.unbatch()
#############################################
#
# Do all the preprocessings on the dataset here...
#
##############################################
# When preprocessings are finished and you are ready to use your dataset:
#### 1. Batch the dataset (only if needed for or applicable to your specific workflow)
#### 2. Repeat the dataset (only if needed for or applicable to specific your workflow)
dataset = dataset.batch(BATCH_SIZE).repeat(NUM_EPOCHS or -1)
@NicolasGervais 的回答中建议的替代解决方案是调整和修改所有 pre-processing 代码以处理 批次样本 而不是单个样本一次取样。
我正在尝试标记 TensorFlow 数据集中的单个列。如果只有一个特征列,我一直使用的方法效果很好,例如:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
df = pd.DataFrame({"text": text,
"target": target})
training_dataset = (
tf.data.Dataset.from_tensor_slices((
tf.cast(df.text.values, tf.string),
tf.cast(df.target, tf.int32))))
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
然而,当我尝试在有一组特征列的地方执行此操作时,比如来自 make_csv_dataset
(每个特征列都被命名),上述方法失败了。 (ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.
).
我尝试使用以下方法在 for 循环中引用特定的特征列:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
gender = [1, 0]
age = [45, 35]
df = pd.DataFrame({"text": text,
"target": target,
"gender": gender,
"age": age})
df.to_csv('test.csv', index=False)
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target')
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
我收到错误:Expected binary or unicode string, got array([])
。引用单个特征列以便我可以标记化的正确方法是什么?通常,您可以在 .map
函数中使用 feature['column_name']
方法引用特征列,例如:
def new_age_func(features, target):
age = features['age']
features['age'] = age/2
return features, targets
dataset = dataset.map(new_age_func)
for features, target in dataset.take(2):
print('Features: {}, Target {}'.format(features, target))
我尝试组合方法并通过映射函数生成词汇表。
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
def vocab_generator(features, target):
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
dataset = dataset.map(vocab_generator)
但这会导致错误:
AttributeError: in user code:
<ipython-input-61-374e4c375b58>:10 vocab_generator *
tokens = tokenizer.tokenize(text.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
并将 tokenizer.tokenize(text.numpy())
更改为 tokenizer.tokenize(text)
会引发另一个错误 TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
错误只是 tokenizer.tokenize
需要一个字符串,而您给它的是一个列表。这个简单的编辑将起作用。我只是做了一个循环,将所有字符串提供给分词器,而不是给它一个字符串列表。
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target',
num_epochs=1)
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
for t in text:
tokens = tokenizer.tokenize(t.numpy())
vocabulary.update(tokens)
make_csv_dataset
创建的数据集的每个元素都是 CVS 文件的 批 行,而不是单行;这就是为什么它需要 batch_size
作为输入参数。另一方面,当前用于处理和标记文本特征的 for
循环每次需要单个输入样本(即行)。因此,tokenizer.tokenize
将在给定一批字符串的情况下失败并引发 TypeError: Expected binary or unicode string, got array(...)
.
用最少的更改解决此问题的一种方法是以某种方式首先 取消批处理 数据集,对数据集执行所有 pre-processings,然后 batch 数据集。幸运的是,这里有一个 built-in unbatch
方法可以使用:
dataset = tf.data.experimental.make_csv_dataset(
...,
# This change is **IMPORTANT**, otherwise the `for` loop would continue forever!
num_epochs=1
)
# Unbatch the dataset; this is required even if you have used `batch_size=1` above.
dataset = dataset.unbatch()
#############################################
#
# Do all the preprocessings on the dataset here...
#
##############################################
# When preprocessings are finished and you are ready to use your dataset:
#### 1. Batch the dataset (only if needed for or applicable to your specific workflow)
#### 2. Repeat the dataset (only if needed for or applicable to specific your workflow)
dataset = dataset.batch(BATCH_SIZE).repeat(NUM_EPOCHS or -1)
@NicolasGervais 的回答中建议的替代解决方案是调整和修改所有 pre-processing 代码以处理 批次样本 而不是单个样本一次取样。