标签标记器不工作，无法计算损失和准确性

Question

我正在使用 Keras Tensorflow 进行 NLP，我目前正在处理 imdb 评论数据集。我想利用hub.KerasLayer。我想直接传递实际的 x 和 y 值。所以在我的 model.fit 语句中，句子为 x，标签为 y。我的代码：

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[], 
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer='adam', metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

正在尝试

history = model.fit(x=training_sentences,
                      y=training_labels,
                      validation_data=(test_sentences, test_labels),
                      epochs=2)

不起作用，因为 training_labels 不在正确的 shape/format 中。我现在的方法是再次使用分词器，因为我会在正确的 format/shape 中得到结果（来自 texts_to_sequences）。为此，我必须首先将其转换为 yes/no（或 a/b 等）字符串。

training_labels_test=[]
for i in training_labels:
   if i==0: training_labels_test.append("no")
   if i==1: training_labels_test.append("yes")
  
testtokenizer=Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad=testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test=[]
for i in test_labels:
   if i==0: val_labels_test.append("no")
   if i==1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad=testtokenizer.texts_to_sequences(val_labels_test)

因为我现在有 1 和 2 作为标签，所以我需要更新我的模型：

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

然后我尝试去适应它：

history = model.fit(x=training_sentences,
                      y=test_labels_pad,
                      validation_data=(test_sentences, val_labels_pad),
                      epochs=2)

问题是loss为nan，精度计算不正确。

哪里错了？

请注意，我的问题实际上是关于这种特定方式以及为什么这个分词器不起作用。我知道还有其他可行的可能性。

Answer 1

这个问题似乎有两个方面。

首先，二进制目标应该始终是 [0, 1] 而不是 [1, 2]。所以，我从你的目标中减去一个。 Tokenizer() 不是用来编码标签的，您应该为此使用 tfds.features.ClassLabel()。现在，我只是在 fit() 调用中减去 1。

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                                       list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

其次，出于某种原因，您的输入层仅返回 nan。在预训练模型的page上，他们说：

google/tf2-preview/gnews-swivel-20dim-with-oov/1 - same as google/tf2-preview/gnews-swivel-20dim/1, but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.

所以你应该使用第二个，因为你的数据集与它训练的数据不完全重叠。然后，您的模型将开始学习。

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

完整运行代码：

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

training_labels_test = []
for i in training_labels:
    if i == 0: training_labels_test.append("no")
    if i == 1: training_labels_test.append("yes")

testtokenizer = Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad = testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test = []
for i in test_labels:
    if i == 0: val_labels_test.append("no")
    if i == 1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad = testtokenizer.texts_to_sequences(val_labels_test)

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                      list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

model.predict(training_sentences)

24896/25000 [==================>.] - ETA: 0s - loss: 0.5482 - sparse_cat_acc: 0.7312

array([[-0.94201976, -1.3173063 ],
       [-3.7894788 , -3.0269182 ],
       [-3.0404441 , -3.4826043 ],
       ...,
       [-2.8379505 , -1.2451388 ],
       [-0.7685702 , -3.1836908 ],
       [-1.7252465 , -3.8163807 ]], dtype=float32)

看看如果你有 3 个类别并使用 [1, 2, 3] 而不是 [0, 1, 2] 会发生什么：

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

nan

但它适用于 [0, 1, 2]:

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

1.3783889

标签标记器不工作，无法计算损失和准确性

Label tokenizer not working, loss and accuracy cannot be calculated

python

nlp

tokenize

keras

tensorflow