在 Pytorch 中嵌入 3D 数据
Embedding 3D data in Pytorch
我想实现字符级嵌入。
这是通常的词嵌入。
词嵌入
Input: [ [‘who’, ‘is’, ‘this’] ]
-> [ [3, 8, 2] ] # (batch_size, sentence_len)
-> // Embedding(Input)
# (batch_size, seq_len, embedding_dim)
这就是我想做的。
字符嵌入
Input: [ [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ] ]
-> [ [ [2, 3, 9, 0], [ 11, 4, 0, 0], [21, 10, 8, 9] ] ] # (batch_size, sentence_len, word_len)
-> // Embedding(Input) # (batch_size, sentence_len, word_len, embedding_dim)
-> // sum each character embeddings # (batch_size, sentence_len, embedding_dim)
The final output shape is same as Word embedding. Because I want to concat them later.
虽然我试过了,但我不确定如何实现 3-D 嵌入。这样的数据你知道怎么实现吗?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd_list = []
for i, elm in enumerate(x):
tmp = torch.zeros(1, word_len, self.embd_size)
for chars in elm:
tmp = torch.add(tmp, 1.0, self.embedding(chars.unsqueeze(0)))
以上代码出错,因为 self.embedding
的输出是 Variable
。
TypeError: torch.add received an invalid combination of arguments - got (torch.FloatTensor, float, Variable), but expected one of:
* (torch.FloatTensor source, float value)
* (torch.FloatTensor source, torch.FloatTensor other)
* (torch.FloatTensor source, torch.SparseFloatTensor other)
* (torch.FloatTensor source, float value, torch.FloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
* (torch.FloatTensor source, float value, torch.SparseFloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
更新
我可以做到。但是for
对batch无效。你们知道更有效的方法吗?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd = Variable(torch.zeros(bs, seq_len, self.embd_size))
for i, elm in enumerate(x): # every sample
for j, chars in enumerate(elm): # every sentence. [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ]
chars_embd = self.embedding(chars.unsqueeze(0)) # (N, word_len, embd_size) [‘w’,‘h’,‘o’,0]
chars_embd = torch.sum(chars_embd, 1) # (N, embd_size). sum each char's embedding
embd[i,j] = chars_embd[0] # set char_embd as word-like embedding
x = embd # (N, seq_len, embd_dim)
更新2
这是我的最终代码。谢谢你,Wasi Ahmad!
def forward(self, x):
# x: (N, seq_len, word_len)
input_shape = x.size()
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
x = x.view(-1, word_len) # (N*seq_len, word_len)
x = self.embedding(x) # (N*seq_len, word_len, embd_size)
x = x.view(*input_shape, -1) # (N, seq_len, word_len, embd_size)
x = x.sum(2) # (N, seq_len, embd_size)
return x
我假设你有一个形状为 BxSxW
的 3d 张量,其中:
B = Batch size
S = Sentence length
W = Word length
你已经声明了嵌入层如下。
self.embedding = nn.Embedding(dict_size, emsize)
其中:
dict_size = No. of unique characters in the training corpus
emsize = Expected size of embeddings
所以,现在您需要将形状为 BxSxW
的 3d 张量转换为形状为 BSxW
的 2d 张量,并将其提供给嵌入层。
emb = self.embedding(input_rep.view(-1, input_rep.size(2)))
emb
的形状将是 BSxWxE
,其中 E
是嵌入大小。您可以按如下方式将生成的 3d 张量转换为 4d 张量。
emb = emb.view(*input_rep.size(), -1)
emb
的最终形状将是 BxSxWxE
,这正是您所期望的。
您要查找的内容已在 allennlp TimeDistributed layer
中实现
这里有一个演示:
from allennlp.modules.time_distributed import TimeDistributed
batch_size = 16
sent_len = 30
word_len = 5
考虑输入的一个句子:
sentence = torch.randn(batch_size, sent_len, word_len) # suppose is your data
定义一个字符嵌入层(假设您还填充了输入):
char_embedding = torch.nn.Embedding(char_vocab_size, char_emd_dim, padding_idx=char_pad_idx)
包起来!
embedding_sentence = TimeDistributed(char_embedding)(sentence) # shape: batch_size, sent_len, word_len, char_emb_dim
embedding_sentence
的形状为 batch_size, sent_len, word_len, char_emb_dim
实际上,您可以轻松地在 PyTorch 中重新定义一个模块来执行此操作。
我想实现字符级嵌入。
这是通常的词嵌入。
词嵌入
Input: [ [‘who’, ‘is’, ‘this’] ]
-> [ [3, 8, 2] ] # (batch_size, sentence_len)
-> // Embedding(Input)
# (batch_size, seq_len, embedding_dim)
这就是我想做的。
字符嵌入
Input: [ [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ] ]
-> [ [ [2, 3, 9, 0], [ 11, 4, 0, 0], [21, 10, 8, 9] ] ] # (batch_size, sentence_len, word_len)
-> // Embedding(Input) # (batch_size, sentence_len, word_len, embedding_dim)
-> // sum each character embeddings # (batch_size, sentence_len, embedding_dim)
The final output shape is same as Word embedding. Because I want to concat them later.
虽然我试过了,但我不确定如何实现 3-D 嵌入。这样的数据你知道怎么实现吗?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd_list = []
for i, elm in enumerate(x):
tmp = torch.zeros(1, word_len, self.embd_size)
for chars in elm:
tmp = torch.add(tmp, 1.0, self.embedding(chars.unsqueeze(0)))
以上代码出错,因为 self.embedding
的输出是 Variable
。
TypeError: torch.add received an invalid combination of arguments - got (torch.FloatTensor, float, Variable), but expected one of:
* (torch.FloatTensor source, float value)
* (torch.FloatTensor source, torch.FloatTensor other)
* (torch.FloatTensor source, torch.SparseFloatTensor other)
* (torch.FloatTensor source, float value, torch.FloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
* (torch.FloatTensor source, float value, torch.SparseFloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
更新
我可以做到。但是for
对batch无效。你们知道更有效的方法吗?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd = Variable(torch.zeros(bs, seq_len, self.embd_size))
for i, elm in enumerate(x): # every sample
for j, chars in enumerate(elm): # every sentence. [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ]
chars_embd = self.embedding(chars.unsqueeze(0)) # (N, word_len, embd_size) [‘w’,‘h’,‘o’,0]
chars_embd = torch.sum(chars_embd, 1) # (N, embd_size). sum each char's embedding
embd[i,j] = chars_embd[0] # set char_embd as word-like embedding
x = embd # (N, seq_len, embd_dim)
更新2
这是我的最终代码。谢谢你,Wasi Ahmad!
def forward(self, x):
# x: (N, seq_len, word_len)
input_shape = x.size()
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
x = x.view(-1, word_len) # (N*seq_len, word_len)
x = self.embedding(x) # (N*seq_len, word_len, embd_size)
x = x.view(*input_shape, -1) # (N, seq_len, word_len, embd_size)
x = x.sum(2) # (N, seq_len, embd_size)
return x
我假设你有一个形状为 BxSxW
的 3d 张量,其中:
B = Batch size
S = Sentence length
W = Word length
你已经声明了嵌入层如下。
self.embedding = nn.Embedding(dict_size, emsize)
其中:
dict_size = No. of unique characters in the training corpus
emsize = Expected size of embeddings
所以,现在您需要将形状为 BxSxW
的 3d 张量转换为形状为 BSxW
的 2d 张量,并将其提供给嵌入层。
emb = self.embedding(input_rep.view(-1, input_rep.size(2)))
emb
的形状将是 BSxWxE
,其中 E
是嵌入大小。您可以按如下方式将生成的 3d 张量转换为 4d 张量。
emb = emb.view(*input_rep.size(), -1)
emb
的最终形状将是 BxSxWxE
,这正是您所期望的。
您要查找的内容已在 allennlp TimeDistributed layer
中实现这里有一个演示:
from allennlp.modules.time_distributed import TimeDistributed
batch_size = 16
sent_len = 30
word_len = 5
考虑输入的一个句子:
sentence = torch.randn(batch_size, sent_len, word_len) # suppose is your data
定义一个字符嵌入层(假设您还填充了输入):
char_embedding = torch.nn.Embedding(char_vocab_size, char_emd_dim, padding_idx=char_pad_idx)
包起来!
embedding_sentence = TimeDistributed(char_embedding)(sentence) # shape: batch_size, sent_len, word_len, char_emb_dim
embedding_sentence
的形状为 batch_size, sent_len, word_len, char_emb_dim
实际上,您可以轻松地在 PyTorch 中重新定义一个模块来执行此操作。