LSTM 自动编码器问题

Question

TLDR：

自动编码器不适合时间序列重建，只预测平均值。

问题设置：

这是我对序列到序列自动编码器尝试的总结。此图片取自本文：https://arxiv.org/pdf/1607.00148.pdf

编码器： 标准 LSTM 层。输入序列在最终隐藏状态中编码。

解码器： LSTM Cell（我认为！）。从最后一个元素 x[N].

开始，一次重建一个元素的序列

对于长度为N的序列，解码算法如下：

获取解码器初始隐藏状态hs[N]：只需使用编码器最终隐藏状态。
重建序列中的最后一个元素：x[N]= w.dot(hs[N]) + b。
其他元素的相同模式：x[i]= w.dot(hs[i]) + b
使用x[i]和hs[i]作为LSTMCell的输入得到x[i-1]和hs[i-1]

最小工作示例：

这是我的实现，从编码器开始：

class SeqEncoderLSTM(nn.Module):
    def __init__(self, n_features, latent_size):
        super(SeqEncoderLSTM, self).__init__()
        
        self.lstm = nn.LSTM(
            n_features, 
            latent_size, 
            batch_first=True)
        
    def forward(self, x):
        _, hs = self.lstm(x)
        return hs

解码器class:

class SeqDecoderLSTM(nn.Module):
    def __init__(self, emb_size, n_features):
        super(SeqDecoderLSTM, self).__init__()
        
        self.cell = nn.LSTMCell(n_features, emb_size)
        self.dense = nn.Linear(emb_size, n_features)
        
    def forward(self, hs_0, seq_len):
        
        x = torch.tensor([])
        
        # Final hidden and cell state from encoder
        hs_i, cs_i = hs_0
        
        # reconstruct first element with encoder output
        x_i = self.dense(hs_i)
        x = torch.cat([x, x_i])
        
        # reconstruct remaining elements
        for i in range(1, seq_len):
            hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
            x_i = self.dense(hs_i)
            x = torch.cat([x, x_i])
        return x

将两者结合起来：

class LSTMEncoderDecoder(nn.Module):
    def __init__(self, n_features, emb_size):
        super(LSTMEncoderDecoder, self).__init__()
        self.n_features = n_features
        self.hidden_size = emb_size

        self.encoder = SeqEncoderLSTM(n_features, emb_size)
        self.decoder = SeqDecoderLSTM(emb_size, n_features)
    
    def forward(self, x):
        seq_len = x.shape[1]
        hs = self.encoder(x)
        hs = tuple([h.squeeze(0) for h in hs])
        out = self.decoder(hs, seq_len)
        return out.unsqueeze(0)

这是我的训练函数：

def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6,  reverse=False):

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Training model on {device}')
    model = model.to(device)
    opt = optimizer(model.parameters(), lr)

    train_loss = []
    valid_loss = []

    for e in tqdm(range(epochs)):
        running_tl = 0
        running_vl = 0
        for x in trainload:
            x = x.to(device).float()
            opt.zero_grad()
            x_hat = model(x)
            if reverse:
                x = torch.flip(x, [1])
            loss = criterion(x_hat, x)
            loss.backward()
            opt.step()
            running_tl += loss.item()

        if testload is not None:
            model.eval()
            with torch.no_grad():
                for x in testload:
                    x = x.to(device).float()
                    loss = criterion(model(x), x)
                    running_vl += loss.item()
                valid_loss.append(running_vl / len(testload))
            model.train()
            
        train_loss.append(running_tl / len(trainload))
    
    return train_loss, valid_loss

数据：

从新闻中抓取的大型事件数据集 (ICEWS)。存在各种类别来描述每个事件。我最初对这些变量进行了单热编码，将数据扩展到 274 维。但是，为了调试模型，我将其缩减为一个长度为 14 个时间步且仅包含 5 个变量的序列。这是我试图过拟合的序列：

tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
        [0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
        [0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
        [0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
        [0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
        [0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
        [0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
        [0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
        [0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
        [0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
        [0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
        [0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
        [0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
        [0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)

这里是习俗 Dataset class:

class TimeseriesDataSet(Dataset):
    def __init__(self, data, window, n_features, overlap=0):
        super().__init__()
        if isinstance(data, (np.ndarray)):
            data = torch.tensor(data)
        elif isinstance(data, (pd.Series, pd.DataFrame)):
            data = torch.tensor(data.copy().to_numpy())
        else: 
            raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
        
        self.n_features = n_features
        self.seqs = torch.split(data, window)
        
    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        try:    
            return self.seqs[idx].view(-1, self.n_features)
        except TypeError:
            raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")

问题：

模型只学习平均值，无论我制作的模型有多复杂，或者我训练它的时间有多长。

Predicted/Reconstruction:

实际：

我的研究：

这个问题与这个问题中讨论的问题相同：

这种情况下的问题最终是 objective 函数在计算损失之前对目标时间序列进行平均。这是由于一些广播错误，因为作者没有正确大小的 objective 函数输入。

就我而言，我不认为这是问题所在。我已经检查并仔细检查了我的所有 dimensions/sizes 行。我很茫然。

我尝试过的其他东西

我尝试了从 7 个时间步到 100 个时间步的不同序列长度。
我已经尝试在时间序列中使用不同数量的变量。我已经尝试使用单变量一直到数据包含的所有 274 个变量。
我已经尝试在 nn.MSELoss 模块上使用各种 reduction 参数。这篇论文要求 sum，但我已经尝试了 sum 和 mean。没有区别。
论文要求以相反的顺序重建序列（见上图）。我已经在原始输入上使用 flipud 尝试了这种方法（在训练之后但在计算损失之前）。这没有区别。
我尝试通过在编码器中添加一个额外的 LSTM 层来使模型更复杂。
我试过使用潜在 space。我试过从输入特征数量的 50% 到 150%。
我试过过度拟合单个序列（在上面的 Data 部分中提供）。

问题：

是什么导致我的模型预测平均值，我该如何解决？

Answer 1

好的，经过一些调试我想我知道原因了。

TLDR

您尝试预测下一个时间步长值而不是当前时间步长与前一个时间步长之间的差异
您的 hidden_features 数字太小，导致模型甚至无法拟合单个样本

分析

使用的代码

先上代码（型号相同）：

import seaborn as sns
import matplotlib.pyplot as plt

def get_data(subtract: bool = False):
    # (1, 14, 5)
    input_tensor = torch.tensor(
        [
            [0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
            [0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
            [0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
            [0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
            [0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
            [0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
            [0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
            [0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
            [0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
            [0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
            [0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
            [0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
            [0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
            [0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
        ]
    ).unsqueeze(0)

    if subtract:
        initial_values = input_tensor[:, 0, :]
        input_tensor -= torch.roll(input_tensor, 1, 1)
        input_tensor[:, 0, :] = initial_values
    return input_tensor


if __name__ == "__main__":
    torch.manual_seed(0)

    HIDDEN_SIZE = 10
    SUBTRACT = False

    input_tensor = get_data(SUBTRACT)
    model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = torch.nn.MSELoss()
    for i in range(1000):
        outputs = model(input_tensor)
        loss = criterion(outputs, input_tensor)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"{i}: {loss}")
        if loss < 1e-4:
            break

    # Plotting
    sns.lineplot(data=outputs.detach().numpy().squeeze())
    sns.lineplot(data=input_tensor.detach().numpy().squeeze())
    plt.show()

它的作用：

get_data 如果 subtract=False 或（如果 subtract=True）它会从您提供的数据中减去 前一个时间步 的值当前时间步长
其余代码优化模型，直到达到 1e-4 损失（因此我们可以比较模型的容量及其增加的帮助，以及当我们使用时间步长的差异而不是时间步长时会发生什么）

我们只会改变 HIDDEN_SIZE 和 SUBTRACT 参数！

没有减法，小模型

HIDDEN_SIZE=5
SUBTRACT=False

在这种情况下，我们得到一条直线。模型无法拟合和掌握数据中呈现的现象（因此您提到的直线）。

达到 1000 次迭代限制

减法，小模型

HIDDEN_SIZE=5
SUBTRACT=True

目标现在远离平坦线，但由于容量太小，模型无法适应。

达到 1000 次迭代限制

没有减法，更大的模型

HIDDEN_SIZE=100
SUBTRACT=False

情况好多了，我们的目标在 942 步后达到。没有更多的扁平线，模型容量似乎很好（对于这个例子！）

减法，更大的模型

HIDDEN_SIZE=100
SUBTRACT=True

虽然图表看起来不那么漂亮，但我们仅在 215 次迭代后就达到了预期的损失。

终于

通常使用时间步差而不是时间步长（或其他一些转换，请参阅 here 了解更多信息）。在其他情况下，神经网络将尝试简单地...复制上一步的输出（因为这是最简单的事情）。将通过这种方式找到一些最小值，超出该最小值将需要更多容量。
当您使用时间步长之间的差异时，无法从之前的时间步长“推断”趋势；神经网络必须了解函数实际如何变化
使用更大的模型（我认为对于整个数据集你应该尝试类似 300 的方法），但你可以简单地调整那个模型。
不要使用 flipud。使用双向 LSTM，通过这种方式，您可以从 LSTM 的前向和反向传递中获取信息（不要与反向传播混淆！）。这也应该提高你的分数

问题

Okay, question 1: You are saying that for variable x in the time series, I should train the model to learn x[i] - x[i-1] rather than the value of x[i]? Am I correctly interpreting?

是的，没错。差异消除了神经网络过多地将其预测基于过去时间步长的冲动（通过简单地获取最后一个值并可能稍微改变它）

Question 2: You said my calculations for zero bottleneck were incorrect. But, for example, let's say I'm using a simple dense network as an auto encoder. Getting the right bottleneck indeed depends on the data. But if you make the bottleneck the same size as the input, you get the identity function.

是的，假设不涉及非线性，这会使事情变得更难（有关类似情况，请参见here）。如果 LSTM 存在非线性，那是一点。

另一个是我们正在将 timesteps 累积到单个编码器状态。所以基本上我们必须将 timesteps 身份累积到一个隐藏和单元状态中，这是极不可能的。

最后一点，根据序列的长度，LSTM 很容易忘记一些最不相关的信息（这是它们的设计目的，而不仅仅是记住所有信息），因此更不可能。

Is num_features * num_timesteps not a bottle neck of the same size as the input, and therefore shouldn't it facilitate the model learning the identity?

是的，但它假设每个数据点都有 num_timesteps，这种情况很少见，可能在这里。关于身份以及为什么很难处理网络的非线性，上面已经回答了。

最后一点，关于身份函数；如果它们真的很容易学习，ResNet 的架构就不太可能成功。网络可以收敛到身份并在没有它的情况下对输出进行“小修复”，但事实并非如此。

I'm curious about the statement : "always use difference of timesteps instead of timesteps" It seem to have some normalizing effect by bringing all the features closer together but I don't understand why this is key ? Having a larger model seemed to be the solution and the substract is just helping.

这里的关键确实是增加模型容量。减法技巧真的取决于数据。让我们想象一个极端的情况：

我们有 100 个时间步，单个特征
初始时间步长值为10000
其他时间步值最多变化 1

神经网络会做什么（这里最简单的是什么）？它可能会将此 1 或更小的变化作为噪声丢弃，并只为所有这些预测 1000 （特别是如果进行了一些正则化），因为被 1/1000 关闭不是很多。

如果我们减去呢？整个神经网络损失在每个时间步长的 [0, 1] 范围内而不是 [0, 1001]，因此错误更严重。

是的，从某种意义上说，它与规范化有关。

LSTM 自动编码器问题

LSTM Autoencoder problems

python

neural-network

autoencoder

lstm

pytorch