我们如何立即将 tidyr::spread() 应用于所有分类变量，为每个分类变量的每个级别创建新列？

Question

我有一个包含 3 个分类变量（x、y、z）和一个 ID 列的数据框：

df <- frame_data(
  ~id, ~x, ~y, ~z,
  1, "a", "c" ,"v",
  1, "b", "d", "f",
  2, "a", "d", "v",
  2, "b", "d", "v")

我想将 spread() 应用于按 ID 分组的每个分类变量。

输出应该是这样的：

id  a  b  c  d  v  f
1  1  1  1  1  1  1
2  1  1  0  2  2  0

我试过这样做，但我一次只能对一个变量做，而不是一起做。

例如：仅对 y 列应用扩展（类似地，可以分别对 x 和 z 应用扩展）但不能在一行中一起应用

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

分三步解释我的代码：

第 1 步：计算频率

df %>% count(id,y)    
id     y         n
<dbl> <chr> <int>
1.00   c     1
1.00   d     1
2.00   d     2

第 2 步：应用 spread()

df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1  1.00     1     1
2  2.00    NA     2

第 3 步：添加 fill = 0，替换 NA，这意味着 id 2 的 y 列中 c 的出现次数为零（如您在 df 中所见）

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

问题：在我的实际数据集中，我有20个这样的分类变量，我无法一一对应。我希望一次完成所有这一切。是否可以将 spread() in tidyr 应用于所有分类变量？如果不能，请推荐一个替代方案

注意：我也尝试了这些答案，但对这种特殊情况没有帮助：

Is it possible to use spread on multiple columns in tidyr similar to dcast?
Can spread() in tidyr spread across multiple value?

其他相关的有用问题：

两个分类列（例如：调查数据集）可能具有相同的值。如下图。

df <- frame_data(
  ~id, ~Do_you_Watch_TV, ~Do_you_Drive, 
  1, "yes", "yes",
  1, "yes", "no",
  2, "yes", "no",
  2, "no", "yes")

# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr>           <chr>       
  1  1.00 yes             yes         
2  1.00 yes             no          
3  2.00 yes             no          
4  2.00 no              yes

运行下面的代码不会区分 'Do_you_Watch_TV'、'Do_you_Drive' 的是和否的计数：

df %>% gather(Key, value, -id) %>% 
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()
id no yes
1  1   3
2  2   2

Whereas, expected output should be :
id Do_you_Watch_TV_no   Do_you_Watch_TV_yes  Do_you_Drive_no   Do_you_Drive_yes
1         0               2                    1                 1
2         1               1                    1                 1

因此，我们需要通过添加前缀来分别处理来自 Do_you_Watch_TV 和 Do_you_Drive 的 No 和 Yes。 Do_you_Drive_是，Do_you_Drive_否，Do_you_Watch_TV_是，Do_you_Watch_TV_否。

我们怎样才能做到这一点？

谢谢

Answer 1

首先，您需要先将数据框转换为长格式，然后才能实际将其转换为宽格式。因此，首先您需要使用 tidyr::gather 并将数据帧转换为长格式。之后，您有几个选择：

选项#1: 使用 tidyr::spread:

#data
df <- frame_data(
  ~id, ~x, ~y, ~z,
  1, "a", "c" ,"v",
  1, "b", "d", "f",
  2, "a", "d", "v",
  2, "b", "d", "v")

library(tidyverse)
df %>% gather(Key, value, -id) %>% 
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()

#   id a b c d f v
# 1  1 1 1 1 1 1 1
# 2  2 1 1 0 2 0 2

选项#2: 另一种选择是使用 reshape2::dcast 作为:

library(tidyverse)
library(reshape2)

df %>% gather(Key, value, -id) %>% 
  dcast(id~value, fun.aggregate = length)

#   id a b c d f v
# 1  1 1 1 1 1 1 1
# 2  2 1 1 0 2 0 2

已编辑：包含第二个数据框的解决方案。

#Data
df1 <- frame_data(
  ~id, ~Do_you_Watch_TV, ~Do_you_Drive, 
  1, "yes", "yes",
  1, "yes", "no",
  2, "yes", "no",
  2, "no", "yes")

library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()

#   id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1  1               1                1                  0                   2
# 2  2               1                1                  1                   1

我们如何立即将 tidyr::spread() 应用于所有分类变量，为每个分类变量的每个级别创建新列？

How can we apply tidyr:: spread() to all categorical variables at once creating new columns for each level of each categorical variable?

r

dataframe

dplyr

tidyr

data-cleaning