如何在张量流中将 TextVectorization 保存到磁盘?
How to save TextVectorization to disk in tensorflow?
我已经训练了一个 TextVectorization 层(见下文),我想将它保存到磁盘,以便下次重新加载?我试过 pickle
和 joblib.dump()
。没用。
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_dataset = tf.data.Dataset.from_tensor_slices(text_clean)
vectorizer = TextVectorization(max_tokens=100000, output_mode='tf-idf',ngrams=None)
vectorizer.adapt(text_dataset.batch(1024))
生成的错误如下:
InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array
如何保存?
可以使用一些 hack 来做到这一点。构建您的 TextVectorization
对象,然后将其放入模型中。保存模型以保存矢量化器。加载模型将重现矢量化器。请参阅下面的示例。
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
data = [
"The sky is blue.",
"Grass is green.",
"Hunter2 is my password.",
]
# Create vectorizer.
text_dataset = tf.data.Dataset.from_tensor_slices(data)
vectorizer = TextVectorization(
max_tokens=100000, output_mode='tf-idf', ngrams=None,
)
vectorizer.adapt(text_dataset.batch(1024))
# Create model.
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorizer)
# Save.
filepath = "tmp-model"
model.save(filepath, save_format="tf")
# Load.
loaded_model = tf.keras.models.load_model(filepath)
loaded_vectorizer = loaded_model.layers[0]
这是两个矢量化器(原始和加载)产生相同输出的测试。
import numpy as np
np.testing.assert_allclose(loaded_vectorizer("blue"), vectorizer("blue"))
不是腌制对象,而是腌制配置和权重。稍后解开它并使用配置来创建对象并加载保存的权重。官方文档 here.
代码
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
# Fit a TextVectorization layer
vectorizer = TextVectorization(max_tokens=10, output_mode='tf-idf',ngrams=None)
vectorizer.adapt(text_dataset.batch(1024))
# Vector for word "this"
print (vectorizer("this"))
# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
print ("*"*10)
# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_v = TextVectorization.from_config(from_disk['config'])
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v.set_weights(from_disk['weights'])
# Lets see the Vector for word "this"
print (new_v("this"))
输出:
tf.Tensor(
[[0. 0. 0. 0. 0.91629076 0.
0. 0. 0. 0. ]], shape=(1, 10), dtype=float32)
**********
tf.Tensor(
[[0. 0. 0. 0. 0.91629076 0.
0. 0. 0. 0. ]], shape=(1, 10), dtype=float32)
借用@jakub 的模型车辆技巧 - 我无法加载模型 - 我最终通过 JSON 序列化路径,如下所示。
注意TextVectorization
层需要tensorflow>=2.7,保存和加载layer/model.
需要使用相同版本
所以,从@jakub 的精彩示例中间开始,
# Save.
model_json = model.to_json()
with open(filepath, "w") as model_json_fh:
model_json_fh.write(model_json)
# Load.
with open(filepath, 'r') as model_json_fh:
loaded_model = tf.keras.models.model_from_json(model_json_fh.read())
vectorization_layer = loaded_model.layers[0]
loaded_model = tf.keras.models.load_model(filepath)
loaded_vectorizer = loaded_model.layers[0]
就是这样。
我不确定一条路线相对于另一条路线的优势。
这也说明了它是如何进行的:
https://machinelearningmastery.com/save-load-keras-deep-learning-models
这有助于解决您在这些地方旅行时可能遇到的 JSON 错误:
如果有人问自己如何在加载 TextVectorization
层的配置时获得 dense
张量而不是 ragged
张量,请尝试显式设置 output_mode
.该问题与最近的一个错误有关,其中 output_mode
来自保存的配置时未正确设置。
这导致 dense
张量:
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
vectorizer = TextVectorization(max_tokens=10, output_mode='int', output_sequence_length = 10)
vectorizer.adapt(text_dataset.batch(1024))
print(vectorizer("this"))
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_vectorizer = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
output_mode='int',
output_sequence_length=from_disk['config']['output_sequence_length'])
new_vectorizer.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_vectorizer.set_weights(from_disk['weights'])
print(new_vectorizer("this"))
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
这会在加载时产生 ragged
张量:
import tensorflow as tf
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
vectorizer = TextVectorization(max_tokens=10, output_mode='int', output_sequence_length = 10)
vectorizer.adapt(text_dataset.batch(1024))
print(vectorizer("this"))
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_vectorizer = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
output_mode=from_disk['config']['output_mode'],
output_sequence_length=from_disk['config']['output_sequence_length'])
new_vectorizer.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_vectorizer.set_weights(from_disk['weights'])
print(new_vectorizer("this"))
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
tf.Tensor([5], shape=(1,), dtype=int64)
我已经训练了一个 TextVectorization 层(见下文),我想将它保存到磁盘,以便下次重新加载?我试过 pickle
和 joblib.dump()
。没用。
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_dataset = tf.data.Dataset.from_tensor_slices(text_clean)
vectorizer = TextVectorization(max_tokens=100000, output_mode='tf-idf',ngrams=None)
vectorizer.adapt(text_dataset.batch(1024))
生成的错误如下:
InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array
如何保存?
可以使用一些 hack 来做到这一点。构建您的 TextVectorization
对象,然后将其放入模型中。保存模型以保存矢量化器。加载模型将重现矢量化器。请参阅下面的示例。
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
data = [
"The sky is blue.",
"Grass is green.",
"Hunter2 is my password.",
]
# Create vectorizer.
text_dataset = tf.data.Dataset.from_tensor_slices(data)
vectorizer = TextVectorization(
max_tokens=100000, output_mode='tf-idf', ngrams=None,
)
vectorizer.adapt(text_dataset.batch(1024))
# Create model.
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorizer)
# Save.
filepath = "tmp-model"
model.save(filepath, save_format="tf")
# Load.
loaded_model = tf.keras.models.load_model(filepath)
loaded_vectorizer = loaded_model.layers[0]
这是两个矢量化器(原始和加载)产生相同输出的测试。
import numpy as np
np.testing.assert_allclose(loaded_vectorizer("blue"), vectorizer("blue"))
不是腌制对象,而是腌制配置和权重。稍后解开它并使用配置来创建对象并加载保存的权重。官方文档 here.
代码
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
# Fit a TextVectorization layer
vectorizer = TextVectorization(max_tokens=10, output_mode='tf-idf',ngrams=None)
vectorizer.adapt(text_dataset.batch(1024))
# Vector for word "this"
print (vectorizer("this"))
# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
print ("*"*10)
# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_v = TextVectorization.from_config(from_disk['config'])
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v.set_weights(from_disk['weights'])
# Lets see the Vector for word "this"
print (new_v("this"))
输出:
tf.Tensor(
[[0. 0. 0. 0. 0.91629076 0.
0. 0. 0. 0. ]], shape=(1, 10), dtype=float32)
**********
tf.Tensor(
[[0. 0. 0. 0. 0.91629076 0.
0. 0. 0. 0. ]], shape=(1, 10), dtype=float32)
借用@jakub 的模型车辆技巧 - 我无法加载模型 - 我最终通过 JSON 序列化路径,如下所示。
注意TextVectorization
层需要tensorflow>=2.7,保存和加载layer/model.
所以,从@jakub 的精彩示例中间开始,
# Save.
model_json = model.to_json()
with open(filepath, "w") as model_json_fh:
model_json_fh.write(model_json)
# Load.
with open(filepath, 'r') as model_json_fh:
loaded_model = tf.keras.models.model_from_json(model_json_fh.read())
vectorization_layer = loaded_model.layers[0]
loaded_model = tf.keras.models.load_model(filepath)
loaded_vectorizer = loaded_model.layers[0]
就是这样。
我不确定一条路线相对于另一条路线的优势。
这也说明了它是如何进行的: https://machinelearningmastery.com/save-load-keras-deep-learning-models
这有助于解决您在这些地方旅行时可能遇到的 JSON 错误:
如果有人问自己如何在加载 TextVectorization
层的配置时获得 dense
张量而不是 ragged
张量,请尝试显式设置 output_mode
.该问题与最近的一个错误有关,其中 output_mode
来自保存的配置时未正确设置。
这导致 dense
张量:
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
vectorizer = TextVectorization(max_tokens=10, output_mode='int', output_sequence_length = 10)
vectorizer.adapt(text_dataset.batch(1024))
print(vectorizer("this"))
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_vectorizer = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
output_mode='int',
output_sequence_length=from_disk['config']['output_sequence_length'])
new_vectorizer.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_vectorizer.set_weights(from_disk['weights'])
print(new_vectorizer("this"))
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
这会在加载时产生 ragged
张量:
import tensorflow as tf
text_dataset = tf.data.Dataset.from_tensor_slices([
"this is some clean text",
"some more text",
"even some more text"])
vectorizer = TextVectorization(max_tokens=10, output_mode='int', output_sequence_length = 10)
vectorizer.adapt(text_dataset.batch(1024))
print(vectorizer("this"))
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("tv_layer.pkl", "wb"))
from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_vectorizer = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
output_mode=from_disk['config']['output_mode'],
output_sequence_length=from_disk['config']['output_sequence_length'])
new_vectorizer.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_vectorizer.set_weights(from_disk['weights'])
print(new_vectorizer("this"))
tf.Tensor([5 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
tf.Tensor([5], shape=(1,), dtype=int64)