Tensorflow LSTM 示例输入格式 batches2string

Question

我正在学习 Udacity 的 LSTM 教程，但很难理解 LSTM 的输入数据格式。 https://github.com/rndbrtrnd/udacity-deep-learning/blob/master/6_lstm.ipynb

谁能解释一下下面代码中的 num_unrolling 是什么？或者如何为 LSTM 模型生成训练批次？

batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()

  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch

  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))

我知道有一个光标。但是，为什么我们要丢弃 64 个批次中除了前 10 个字符 (num_unrolling) 之外的其余文本？

您可以向我指出任何可以帮助我理解输入格式的资源或示例。谢谢！

Answer 1

请记住，您正在训练的 RNN 的目的是预测字符串中的下一个字符，并且每个字符位置都有一个 LSTM。在上面引用的代码中，他们还进行了从字符到数字的映射 ' ' 是 0，'a' 是 1，'b' 是 2 等等。这进一步翻译成 '1- hot' 编码即 ' '，即 0 编码为 [1 0 0 0 ... 0]，'a' 编码为 1，即 [0 1 0 0 ... 0]，'b' 为[0 0 1 0 0 ... 0]。在我下面的解释中，为了清楚起见，我跳过了这个映射，所以我所有的字符都应该是数字或者实际上是 1-hot 编码。

让我从更简单的情况开始，其中 batch_size = 1 且 num_unrollings =1。让我们也说你的训练数据是 "anarchists advocate social relations based upon voluntary association of autonomous individuals mutu"

在这种情况下，您的第一个字符是无政府主义者中的 'a'，预期输出（标签）是 'n'。在代码中，这由 next() 的 return 值表示。 batches = [ [ 'a' ], ['n' ]]，其中列表的第一个元素是输入，最后一个元素是标签。然后在步骤 0 中将其馈入 RNN。在下一步中，输入是 'n'，标签是 'a'（'anarchi...' 中的第三个字母，所以下一步批次 = [ ['n'], ['a'] ] 第三步批次是批次 = [ ['a'] , ['r']] 等等。请注意内部列表中的最后一个元素 (self._last_batch) 如何成为下一个时间步 (batches = [self._last_batch]).

中内部列表中的第一个元素

这是如果 num_unrollings = 1。如果 num_unrollings = 5 那么你每次在每个时间步中前进 num_unrolling=5 个 lstm 单位，而不是每次只前进一个 lstm 单位。因此，下一个函数应该为第 5 个 RNN 提供输入，即 5 个字符 'a'、'n'、'a'、'r'、'c' 以及对应标签'n'、'a'、'r'、'c'、'h'。请注意，最后四个输入字符与前 4 个标签相同，因此为了提高内存效率，将其编码为前 6 个字符的列表，即

批次 = [ [ 'a'],['n'],['a'],['r'],['c'], ['h'] ],

前 5 个字符是输入，后 5 个字符是标签。下一次调用 next returns 接下来 5 个 lstms

的输入和标签

batches = [ [ 'h'], ['i'], ['s'], ['t'], ['s'], [' '] ] ], 请注意 'h' 也在此列表中，因为它以前仅用作标签，现在仅用作输入。

if batch_size > 1 您同时将多个序列输入 RNN 更新步骤。注意这里的游标不是一个游标而是一个游标列表——每个序列一个。现在考虑 batch_size = 2。在上面的示例中，训练数据是 100 个字符 "anarchists advocate social relations based upon voluntary association of autonomous individuals mutu"，第二个文本序列只是从中间开始 "luntary association of autonomous individuals mutu", 所以第一步中的批次包含信息 ['a'、'n'、'a'、'r'、'c'、'h'] 和 ['l','u','n','t','a','r'],分别对应前面的前6个字符和后面的前6个字符中间。但它的组织如下（转置）批次 = [ [ 'a', 'l'], ['n', 'u'], ['a', 'n'], ['r', 't'], [c', 'a'], ['h', 'r']] 第二个时间步长的批次是包含信息 [ 'h'、'i'、's'、't'、's'、' ' ] 和 ['r'、'y' , ' ', 'a', 's', 's'] , 但再次转置批次 = [[ 'h', 'r'], ['i', 'y'], ['s', ' '], ['t', 'a'], ['s', 's', [' ', 's'] ] 等等。

以上是对 num_unrollings 对批次生成意味着什么的技术回答。然而 num_unrollings 也是你在 RNN 中更新权重的反向传播部分返回的字符数。这是因为在 RNN 学习算法的每个时间步中，您输入 num_unrolling 个输入字符，并且您只计算相应的 lstm，而序列前一部分的（隐藏）输入存储在变量中那是不可训练的。您可以尝试将 num_urollings 设置为 1 并查看学习长程相关性是否更难。（您可能需要很多时间步长）。

Tensorflow LSTM 示例输入格式 batches2string

Tensorflow LSTM example input format batches2string

deep-learning

lstm

tensorflow