从长到宽,自动创建虚拟对象和多个值列
Long to wide with automatic dummy creation and multiple value columns
我正坐在一个看起来像这样的数据框前:
country year Indicator a b c
48996 US 2003 var1 NA NA NA
16953 FR 1988 var2 NA 10664.920 NA
22973 FR 1943 var3 NA 5774.334 NA
8760 CN 1995 var4 8804.565 NA 12750.31
47795 US 2012 var5 NA NA NA
30033 GB 1969 var6 NA 29631.362 NA
25796 FR 1921 var7 NA 14004.520 NA
39534 NL 1941 var8 NA NA NA
42255 NZ 1969 var8 NA NA NA
7249 CN 1995 var9 50635.862 NA 75260.56
我想做的基本上是一个从长到宽的转换,以 Indicator
作为关键变量。我通常会使用 tidyr
包中的 spread()
。但是,不幸的是,spread()
不接受多个值列(在本例中为 a
、b
和 c
)并且它没有完全实现我想要实现的目标:
- 将
Indicator
的条目设为新列
- 将国家/地区/年份组合保留为行
- 为
a
、b
和 c
中的每个旧值创建一个唯一行
- 为每个 "old" 值列名称创建一个虚拟变量(即 a,
b, c)
所以最后我这个例子的中文观察应该变成
country year var1 [...] var4 [...] var9 dummy.a dummy.b dummy.c
CN 1995 NA 8804.565 50635.862 1 0 0
CN 1995 NA 12750.31 75260.56 0 0 1
由于我的原始数据帧是 58.162x119,我希望能有一些不包含大量手动工作的东西:-)
我希望我清楚自己想要达到的目标。感谢您的帮助!
可以使用以下代码重现上述数据帧:
structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
这是我的解决方案:
require(tidyr)
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0)
输出
# country year var2 var3 var4 var6 var7 var9 a b c
# 1 CN 1995 NA NA 8804.565 NA NA 50635.86 1 0 0
# 2 CN 1995 NA NA 12750.306 NA NA 75260.56 0 0 1
# 3 FR 1921 NA NA NA NA 14004.52 NA 0 1 0
# 4 FR 1943 NA 5774.334 NA NA NA NA 0 1 0
# 5 FR 1988 10664.92 NA NA NA NA NA 0 1 0
# 6 GB 1969 NA NA NA 29631.36 NA NA 0 1 0
dt
将是您的原始数据。 dt2
是最终输出。
dt2 <- dt %>%
gather(Parameter, Value, a:c) %>%
spread(Indicator, Value) %>%
mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>%
filter(Data != 0) %>%
spread(Parameter, Data, fill = 0) %>%
rename(dummy.a = a, dummy.b = b, dummy.c = c)
我正坐在一个看起来像这样的数据框前:
country year Indicator a b c
48996 US 2003 var1 NA NA NA
16953 FR 1988 var2 NA 10664.920 NA
22973 FR 1943 var3 NA 5774.334 NA
8760 CN 1995 var4 8804.565 NA 12750.31
47795 US 2012 var5 NA NA NA
30033 GB 1969 var6 NA 29631.362 NA
25796 FR 1921 var7 NA 14004.520 NA
39534 NL 1941 var8 NA NA NA
42255 NZ 1969 var8 NA NA NA
7249 CN 1995 var9 50635.862 NA 75260.56
我想做的基本上是一个从长到宽的转换,以 Indicator
作为关键变量。我通常会使用 tidyr
包中的 spread()
。但是,不幸的是,spread()
不接受多个值列(在本例中为 a
、b
和 c
)并且它没有完全实现我想要实现的目标:
- 将
Indicator
的条目设为新列 - 将国家/地区/年份组合保留为行
- 为
a
、b
和c
中的每个旧值创建一个唯一行
- 为每个 "old" 值列名称创建一个虚拟变量(即 a, b, c)
所以最后我这个例子的中文观察应该变成
country year var1 [...] var4 [...] var9 dummy.a dummy.b dummy.c
CN 1995 NA 8804.565 50635.862 1 0 0
CN 1995 NA 12750.31 75260.56 0 0 1
由于我的原始数据帧是 58.162x119,我希望能有一些不包含大量手动工作的东西:-)
我希望我清楚自己想要达到的目标。感谢您的帮助!
可以使用以下代码重现上述数据帧:
structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
这是我的解决方案:
require(tidyr)
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0)
输出
# country year var2 var3 var4 var6 var7 var9 a b c
# 1 CN 1995 NA NA 8804.565 NA NA 50635.86 1 0 0
# 2 CN 1995 NA NA 12750.306 NA NA 75260.56 0 0 1
# 3 FR 1921 NA NA NA NA 14004.52 NA 0 1 0
# 4 FR 1943 NA 5774.334 NA NA NA NA 0 1 0
# 5 FR 1988 10664.92 NA NA NA NA NA 0 1 0
# 6 GB 1969 NA NA NA 29631.36 NA NA 0 1 0
dt
将是您的原始数据。 dt2
是最终输出。
dt2 <- dt %>%
gather(Parameter, Value, a:c) %>%
spread(Indicator, Value) %>%
mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>%
filter(Data != 0) %>%
spread(Parameter, Data, fill = 0) %>%
rename(dummy.a = a, dummy.b = b, dummy.c = c)