如何优化我的代码以逆变换 TextVectorization 的输出?
How can I optimize my code to inverse transform the output of TextVectorization?
我在 TF Keras 序列模型中使用 TextVectorization 层。我需要将中间 TextVectorization 层的输出转换为纯文本。我发现没有直接的方法可以做到这一点。所以我使用了 TextVectorization 层的词汇来对向量进行逆变换。代码如下:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_list = np.array(["this is the first sentence.","second line of the dataset."]) # a list of 2 sentences
textvectorizer = TextVectorization(max_tokens=None,
standardize=None,
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False)
textvectorizer.adapt(text_list)
vectors = textvectorizer(text_list)
vectors
向量:
array([[ 3, 7, 2, 9, 4],
[ 5, 6, 8, 2, 10]])
现在,我想将向量转换为句子。
my_vocab = textvectorizer.get_vocabulary()
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
print(plain_text_list)
输出:
['this is the first sentence.', 'second line of the dataset.']
我成功获得了想要的结果。但是,由于我在代码中使用的方法很幼稚,当应用于大型数据集时,这种方法非常慢。我想减少这个方法的执行时间。
也许试试 np.vectorize
:
import numpy as np
my_vocab = textvectorizer.get_vocabulary()
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
print(np.vectorize(index_vocab.get)(vectors))
[['this' 'is' 'the' 'first' 'sentence.']
['second' 'line' 'of' 'the' 'dataset.']]
性能测试:
import numpy as np
import timeit
my_vocab = textvectorizer.get_vocabulary()
def method1(my_vocab, vectors):
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
return np.vectorize(index_vocab.get)(vectors)
def method2(my_vocab, vectors):
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
return plain_text_list
t1 = timeit.Timer(lambda: method1(my_vocab, vectors))
t2 = timeit.Timer(lambda: method2(my_vocab, vectors))
print(t1.timeit(5000))
print(t2.timeit(5000))
0.3139600929998778
19.671524284000043
我在 TF Keras 序列模型中使用 TextVectorization 层。我需要将中间 TextVectorization 层的输出转换为纯文本。我发现没有直接的方法可以做到这一点。所以我使用了 TextVectorization 层的词汇来对向量进行逆变换。代码如下:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_list = np.array(["this is the first sentence.","second line of the dataset."]) # a list of 2 sentences
textvectorizer = TextVectorization(max_tokens=None,
standardize=None,
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False)
textvectorizer.adapt(text_list)
vectors = textvectorizer(text_list)
vectors
向量:
array([[ 3, 7, 2, 9, 4],
[ 5, 6, 8, 2, 10]])
现在,我想将向量转换为句子。
my_vocab = textvectorizer.get_vocabulary()
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
print(plain_text_list)
输出:
['this is the first sentence.', 'second line of the dataset.']
我成功获得了想要的结果。但是,由于我在代码中使用的方法很幼稚,当应用于大型数据集时,这种方法非常慢。我想减少这个方法的执行时间。
也许试试 np.vectorize
:
import numpy as np
my_vocab = textvectorizer.get_vocabulary()
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
print(np.vectorize(index_vocab.get)(vectors))
[['this' 'is' 'the' 'first' 'sentence.']
['second' 'line' 'of' 'the' 'dataset.']]
性能测试:
import numpy as np
import timeit
my_vocab = textvectorizer.get_vocabulary()
def method1(my_vocab, vectors):
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
return np.vectorize(index_vocab.get)(vectors)
def method2(my_vocab, vectors):
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
return plain_text_list
t1 = timeit.Timer(lambda: method1(my_vocab, vectors))
t2 = timeit.Timer(lambda: method2(my_vocab, vectors))
print(t1.timeit(5000))
print(t2.timeit(5000))
0.3139600929998778
19.671524284000043