长格式和宽格式的数据,需要在 R 中转换为长格式

Data In Long and Wide Format, Need to Convert to Just Long in R

我正在处理一个既有宽格式又有宽格式的数据集。看起来像:

ID week1 week2 week3 ... week12  
1   2     NA     NA  ...  NA  
1   NA    3      NA  ...  NA
1   NA    NA     3   ...  NA
...
1   NA    NA     NA  ...  4
2   4     NA     NA  ...  NA
2   NA    5      NA  ...  NA
2   NA    NA     3   ...  NA

我现在正在努力将其转换为单独的长格式以供分析。我希望将其设置为:

ID week value
1   1    2
1   2    3
1   3    3
...
1   12   4
2   1    4
2   2    5
2   3    3

任何人都可以就在 R 中执行此操作提出任何建议吗?我已经尝试过 reshape2 和 dplyr/tidyr,但是当我 select ID 变量时,我总是得到太多的观察结果。

这个怎么样:

library(dplyr)

# small data sample
df <- read.table(text = 'ID week1 week2 week3 week4  
1   2     NA     NA    NA  
1   NA    3      NA    NA
1   NA    NA     3     NA
1   NA    NA     NA    4
2   4     NA     NA    NA
2   NA    5      NA    NA
2   NA    NA     3     NA', header = T)

df %>% 
   data.table::melt(id.vars = 'ID') %>% 
   na.omit()

1) gather 使用 wide 在最后的注释 1 中重复显示,使用 gatherwide 转换为长格式, 删除 NA 行和排序。

library(dplyr)
library(tidyr)

wide %>%
  gather("week", "value", -ID) %>%
  drop_na %>%
  arrange(ID, week)

给予:

  ID  week value
1  1 week1     2
2  1 week2     3
3  1 week3     3
4  1 week4     4
5  2 week1     4
6  2 week2     5
7  2 week3     3

2) 重塑 仅使用基数 R:

varying <- list(value = 2:5)
long <- na.omit(reshape(wide, dir = "long", timevar = "week", 
  varying = varying, v.names = names(varying)))[1:3]
long[order(long$ID, long$week), ]

给予:

    ID week value
1.1  1    1     2
2.2  1    2     3
3.3  1    3     3
4.4  1    4     4
5.1  2    1     4
6.2  2    2     5
7.3  2    3     3

3) data.table 使用 (2) 中的 varying 我们可以使用 data.table 中的 melt。请注意,我们可以指定 id.vars 或 measure.vars 但在评论中指出我们可能希望将其推广到多个变量并且 measure.vars 方法推广。

library(data.table)
longDT <- na.omit(melt(as.data.table(wide), measure.vars = varying, 
  variable.name = "week"))
setkey(longDT, ID, week)
longDT

给予:

   ID  week value
1:  1 week1     2
2:  1 week2     3
3:  1 week3     3
4:  1 week4     4
5:  2 week1     4
6:  2 week2     5
7:  2 week3     3

注释 1

以可重现形式使用的输入是:

Lines <- "
ID week1 week2 week3 week4
1   2     NA     NA   NA  
1   NA    3      NA   NA
1   NA    NA     3    NA
1   NA    NA     NA   4
2   4     NA     NA   NA
2   NA    5      NA   NA
2   NA    NA     3    NA"
wide <- read.table(text = Lines, header = TRUE)

注2

关于具有多个变量 data.table 的 melt 支持这一点。 假设我们有以下内容:

Lines2 <- "
ID week1var1 week1var2 week2var1 week2var2 week3var1 week3var2 week4var1 week4var2
1 1 2 20 NA NA NA NA NA NA
2 1 NA NA 3 30 NA NA NA NA
3 1 NA NA NA NA 3 30 NA NA
4 1 NA NA NA NA NA NA 4 40
5 2 4 40 NA NA NA NA NA NA
6 2 NA NA 5 50 NA NA NA NA
7 2 NA NA NA NA 3 30 NA NA"
wide2 <- read.table(text = Lines, header = TRUE)

library(data.table)

varying2 <- split(names(wide2)[-1], 
  sub("(.*\d)(\D.*)", "\2", names(wide2)[-1]))

longDT2 <- na.omit(melt(as.data.table(wide2), measure.vars = varying2, 
  variable.name = "week"))
setkey(longDT2, ID, week)
longDT2

给予:

   ID week var1 var2
1:  1    1    2   20
2:  1    2    3   30
3:  1    3    3   30
4:  1    4    4   40
5:  2    1    4   40
6:  2    2    5   50
7:  2    3    3   30