从长到宽,自动创建虚拟对象和多个值列

Long to wide with automatic dummy creation and multiple value columns

我正坐在一个看起来像这样的数据框前:

      country year Indicator         a         b        c
48996      US 2003      var1        NA        NA       NA
16953      FR 1988      var2        NA 10664.920       NA
22973      FR 1943      var3        NA  5774.334       NA
8760       CN 1995      var4  8804.565        NA 12750.31
47795      US 2012      var5        NA        NA       NA
30033      GB 1969      var6        NA 29631.362       NA
25796      FR 1921      var7        NA 14004.520       NA
39534      NL 1941      var8        NA        NA       NA
42255      NZ 1969      var8        NA        NA       NA
7249       CN 1995      var9 50635.862        NA 75260.56

我想做的基本上是一个从长到宽的转换,以 Indicator 作为关键变量。我通常会使用 tidyr 包中的 spread()。但是,不幸的是,spread() 不接受多个值列(在本例中为 abc)并且它没有完全实现我想要实现的目标:

  1. Indicator 的条目设为新列
  2. 将国家/地区/年份组合保留为行
  3. abc
  4. 中的每个旧值创建一个唯一行
  5. 为每个 "old" 值列名称创建一个虚拟变量(即 a, b, c)

所以最后我这个例子的中文观察应该变成

country year var1 [...] var4       [...]   var9       dummy.a dummy.b dummy.c 
CN      1995 NA         8804.565           50635.862        1       0       0
CN      1995 NA         12750.31           75260.56         0       0       1

由于我的原始数据帧是 58.162x119,我希望能有一些不包含大量手动工作的东西:-)

我希望我清楚自己想要达到的目标。感谢您的帮助!


可以使用以下代码重现上述数据帧:

structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
"var11", "var12", "var13", "var14", "var15", "var16", "var17", 
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
7249L), class = "data.frame")

这是我的解决方案:

require(tidyr)
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
    "FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
    2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
    "var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
    "var11", "var12", "var13", "var14", "var15", "var16", "var17", 
    "var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
    NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
    5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
    NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
    )), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
    16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
    7249L), class = "data.frame")

mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0)

输出

# country year     var2     var3      var4     var6     var7     var9 a b c
# 1      CN 1995       NA       NA  8804.565       NA       NA 50635.86 1 0 0
# 2      CN 1995       NA       NA 12750.306       NA       NA 75260.56 0 0 1
# 3      FR 1921       NA       NA        NA       NA 14004.52       NA 0 1 0
# 4      FR 1943       NA 5774.334        NA       NA       NA       NA 0 1 0
# 5      FR 1988 10664.92       NA        NA       NA       NA       NA 0 1 0
# 6      GB 1969       NA       NA        NA 29631.36       NA       NA 0 1 0

dt 将是您的原始数据。 dt2 是最终输出。

dt2 <- dt %>%
  gather(Parameter, Value, a:c) %>%
  spread(Indicator, Value) %>%
  mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>%
  filter(Data != 0) %>%
  spread(Parameter, Data, fill = 0) %>%
  rename(dummy.a = a, dummy.b = b, dummy.c = c)