传播然后收集列的内存有效方法是什么？（见例子）

Question

我正在尝试重新排列我的数据以供下游处理。我找到了一种方法来完成我想要的，但它是内存密集型的，我相信有一种更有效的方法。

这是数据中的一个例子：

   X.1 Label       X
81  81    21 367.138
82  82    21 384.295
83  83    21 159.496
84  84    21 269.927
85  85    22 364.118
86  86    22 154.475
87  87    22 265.861

我想重新排列数据，为每个单独的对象创建一个 table X 值，如下所示：

    1       2       3       4
1 367.138 384.295 159.496 269.927
2 364.118 154.475 265.861      NA

我可以使用如下所示的 spread、apply 和 ldply 函数很好地完成此操作：

X <- apply(tidyr::spread(X, Label,X), 2, function(x) na.omit(x))
X<-X[-1]
X<-plyr::ldply(X, rbind)
X<-as.data.frame(X[-1])

这是问题所在，spread 函数生成以下 table 作为中间步骤：

  X.1       1       2
1  81 367.138      NA
2  82 384.295      NA
3  83 159.496      NA
4  84 269.927      NA
5  85      NA 364.118
6  86      NA 154.475
7  87      NA 265.861

这对于小数据集很好，但对于大数据集，生成的 table 非常大，我运行内存不足，这会产生以下错误：

Error: cannot allocate vector of size 8.4 Gb

我敢肯定，一定有一种更有效的方法可以做到这一点，而不会产生大量的中间体 table。有任何想法吗？

Answer 1

一个选项使用data.table

dcast(DT, rleid(Label) ~ rowid(Label), value.var = "X")
#   Label       1       2       3       4
#1:     1 367.138 384.295 159.496 269.927
#2:     2 364.118 154.475 265.861      NA

数据

library(data.table)
DT <- fread(text = "   X.1 Label       X
  81    21 367.138
  82    21 384.295
  83    21 159.496
  84    21 269.927
  85    22 364.118
  86    22 154.475
  87    22 265.861")

传播然后收集列的内存有效方法是什么？（见例子）

What is a memory-efficient method to spread then gather columns? (see example)

r

memory-efficient

reshape

传播然后收集列的内存有效方法是什么？ （见例子）

What is a memory-efficient method to spread then gather columns? (see example)

r

memory-efficient

reshape

传播然后收集列的内存有效方法是什么？（见例子）