如何从预训练的词嵌入数据集创建 Keras 嵌入层?
How do I create a Keras Embedding layer from a pre-trained word embedding dataset?
如何将预训练的词嵌入加载到 Keras Embedding
层中?
我从 https://nlp.stanford.edu/projects/glove/) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/
下载了 glove.6B.50d.txt
(手套。6B.zip 文件
有一篇很棒的博客 post 描述了如何使用预训练词向量嵌入创建嵌入层:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
可在此处找到上述文章的代码:
https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py
另一个相同目的的好博客:https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
您需要将嵌入矩阵传递给 Embedding
层,如下所示:
Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
vocabLen
:你词汇表中的标记数量
embDim
:嵌入向量维度(在您的示例中为 50)
embeddingMatrix
:从 glove.6B 构建的嵌入矩阵。50d.txt
isTrainable
:您是希望嵌入可训练还是冻结层
glove.6B.50d.txt
是一个空格分隔值列表:词标记 + (50) 嵌入值。例如the 0.418 0.24968 -0.41242 ...
从 Glove 文件创建 pretrainedEmbeddingLayer
:
# Prepare Glove File
def readGloveFile(gloveFile):
with open(gloveFile, 'r') as f:
wordToGlove = {} # map from a token (word) to a Glove embedding vector
wordToIndex = {} # map from a token to an index
indexToWord = {} # map from an index to a token
for line in f:
record = line.strip().split()
token = record[0] # take the token (word) from the text line
wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)
tokens = sorted(wordToGlove.keys())
for idx, tok in enumerate(tokens):
kerasIdx = idx + 1 # 0 is reserved for masking in Keras (see above)
wordToIndex[tok] = kerasIdx # associate an index to a token (word)
indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above
return wordToIndex, indexToWord, wordToGlove
# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
vocabLen = len(wordToIndex) + 1 # adding 1 to account for masking
embDim = next(iter(wordToGlove.values())).shape[0] # works with any glove dimensions (e.g. 50)
embeddingMatrix = np.zeros((vocabLen, embDim)) # initialize with zeros
for word, index in wordToIndex.items():
embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding
embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
return embeddingLayer
# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...
几年前,我写了一个名为 embfile 的实用程序包来处理“嵌入文件”(但我只在 2020 年发布了它)。我想介绍的用例是创建 pre-trained 嵌入矩阵来初始化 Embedding
层。我想通过只加载我需要的词向量来做到这一点,并且尽可能快。
支持多种格式:
- .txt(有或没有“header行”)
- .bin, Google Word2Vec 格式
- .vvm,我使用的自定义格式(它只是一个 TAR 文件,词汇表、向量和元数据在单独的文件中,因此词汇表可以在几分之一秒内完全读取,向量可以随机已访问)。
包裹是extensively documented and tested. There are also examples that show how to use it with Keras。
import embfile
with embfile.open(EMBEDDING_FILE_PATH) as f:
emb_matrix, word2index, missing_words = embfile.build_matrix(
f,
words=vocab, # this could also be a word2index dictionary as well
start_index=1, # leave the first row to zeros
)
此函数还处理文件词汇表外单词的初始化。默认情况下,它在找到的向量上符合正态分布,并用它来生成新的随机向量(这就是 AllenNLP 所做的)。我不确定此功能是否仍然相关:现在您可以使用 FastText 或其他任何工具为未知单词生成嵌入。
请记住,txt 和 bin 文件本质上是顺序文件,需要进行全面扫描(除非您在结束前找到所有要查找的词)。这就是我使用 vvm 文件的原因,它提供对矢量的随机访问。可以通过索引顺序文件来解决问题,但 embfile 没有此功能。尽管如此,您可以将顺序文件转换为 vvm(这类似于创建索引并将所有内容打包到一个文件中)。
我正在搜索类似的东西。我发现这个博客 post 回答了这个问题。它正确地解释了热创建一个 embedding_matrix
并将其传递给 Embedding()
层。
如何将预训练的词嵌入加载到 Keras Embedding
层中?
我从 https://nlp.stanford.edu/projects/glove/) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/
下载了glove.6B.50d.txt
(手套。6B.zip 文件
有一篇很棒的博客 post 描述了如何使用预训练词向量嵌入创建嵌入层:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
可在此处找到上述文章的代码:
https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py
另一个相同目的的好博客:https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
您需要将嵌入矩阵传递给 Embedding
层,如下所示:
Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
vocabLen
:你词汇表中的标记数量embDim
:嵌入向量维度(在您的示例中为 50)embeddingMatrix
:从 glove.6B 构建的嵌入矩阵。50d.txtisTrainable
:您是希望嵌入可训练还是冻结层
glove.6B.50d.txt
是一个空格分隔值列表:词标记 + (50) 嵌入值。例如the 0.418 0.24968 -0.41242 ...
从 Glove 文件创建 pretrainedEmbeddingLayer
:
# Prepare Glove File
def readGloveFile(gloveFile):
with open(gloveFile, 'r') as f:
wordToGlove = {} # map from a token (word) to a Glove embedding vector
wordToIndex = {} # map from a token to an index
indexToWord = {} # map from an index to a token
for line in f:
record = line.strip().split()
token = record[0] # take the token (word) from the text line
wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)
tokens = sorted(wordToGlove.keys())
for idx, tok in enumerate(tokens):
kerasIdx = idx + 1 # 0 is reserved for masking in Keras (see above)
wordToIndex[tok] = kerasIdx # associate an index to a token (word)
indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above
return wordToIndex, indexToWord, wordToGlove
# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
vocabLen = len(wordToIndex) + 1 # adding 1 to account for masking
embDim = next(iter(wordToGlove.values())).shape[0] # works with any glove dimensions (e.g. 50)
embeddingMatrix = np.zeros((vocabLen, embDim)) # initialize with zeros
for word, index in wordToIndex.items():
embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding
embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
return embeddingLayer
# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...
几年前,我写了一个名为 embfile 的实用程序包来处理“嵌入文件”(但我只在 2020 年发布了它)。我想介绍的用例是创建 pre-trained 嵌入矩阵来初始化 Embedding
层。我想通过只加载我需要的词向量来做到这一点,并且尽可能快。
支持多种格式:
- .txt(有或没有“header行”)
- .bin, Google Word2Vec 格式
- .vvm,我使用的自定义格式(它只是一个 TAR 文件,词汇表、向量和元数据在单独的文件中,因此词汇表可以在几分之一秒内完全读取,向量可以随机已访问)。
包裹是extensively documented and tested. There are also examples that show how to use it with Keras。
import embfile
with embfile.open(EMBEDDING_FILE_PATH) as f:
emb_matrix, word2index, missing_words = embfile.build_matrix(
f,
words=vocab, # this could also be a word2index dictionary as well
start_index=1, # leave the first row to zeros
)
此函数还处理文件词汇表外单词的初始化。默认情况下,它在找到的向量上符合正态分布,并用它来生成新的随机向量(这就是 AllenNLP 所做的)。我不确定此功能是否仍然相关:现在您可以使用 FastText 或其他任何工具为未知单词生成嵌入。
请记住,txt 和 bin 文件本质上是顺序文件,需要进行全面扫描(除非您在结束前找到所有要查找的词)。这就是我使用 vvm 文件的原因,它提供对矢量的随机访问。可以通过索引顺序文件来解决问题,但 embfile 没有此功能。尽管如此,您可以将顺序文件转换为 vvm(这类似于创建索引并将所有内容打包到一个文件中)。
我正在搜索类似的东西。我发现这个博客 post 回答了这个问题。它正确地解释了热创建一个 embedding_matrix
并将其传递给 Embedding()
层。