根据R中的分类变量比较两组

Question

我创建了df，其中包含超过 8,000 个公司年

gvkey = 公司编号

fam = 虚拟（如果公司是家族企业则等于 1）

industry = 分类变量

   gvkey   fam  industry
1   1004    0     6
2   1004    0     6
3   1004    0     6
4   1004    0     6
5   1004    0     6
6   1013    0     4
7   1013    0     4
8   1013    0     4
9   1013    0     4
10  1013    0     4
11  1013    0     4
12  1045    0     5
13  1045    0     5
14  1045    0     5
15  1045    0     5
16  1045    0     5
17  1045    0     5
18  1072    0     4
19  1072    0     4
20  1072    0     4
21  1072    0     4
22  1072    0     4
23  1076    1     9
24  1076    1     9
25  1076    1     9
26  1076    1     9
27  1076    1     9
28  1076    1     9
29  1078    0     4
30  1078    0     4
31  1078    0     4
32  1078    0     4
33  1078    0     4
34  1078    0     4
35  1121    1     6
36  1121    1     6
37  1121    1     6
38  1121    1     6
39  1121    1     6
40  1121    1     6
41  1161    0     4
42  1161    0     4
43  1161    0     4
44  1161    0     4
45  1161    0     4
46  1161    0     4
47  1209    0     4
48  1209    0     4
49  1209    0     4
50  1209    0     4
...

输出应该是这样的。行业描述=industry

语言逻辑：

1) 对于所有唯一 gvkey 创建一个列，计算每个行业中 fam = 0 的数量。

2) 对于所有唯一 gvkey 创建一个列，计算每个行业中 fam = 1 的数量。

3) 创建一个输出，显示每个 idnustry 的家族企业和非家族企业的频率

也许甚至可以用一个代码执行这个？！

非常感谢！！

Answer 1

一个 dplyr 选项可以是：

df %>%
 group_by(industry) %>%
 summarise(n_family = n_distinct(gvkey[fam == 1]),
           n_no_family = n_distinct(gvkey[fam == 0]),
           perc_family = n_family/n_distinct(gvkey)*100) 

  industry n_family n_no_family perc_family
     <int>    <int>       <int>       <dbl>
1        4        0           5           0
2        5        0           1           0
3        6        1           1          50
4        9        1           0         100

Answer 2

你的语言逻辑对我来说不是很清楚（特别是关于最终输出的唯一 gvkey 的陈述），但这里我提供了两个结果，所以你可以看到你想要的是哪个：

结果 1：使用 unique(df) 计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     unique(df),
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

这样

> dfout
  Industry FamCnt NoFamCnt FamPerc
1        4      5        0       0
2        5      1        0       0
3        6      1        1      50
4        9      0        1     100

结果 2：使用 df 计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     df,
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

这样

> dfout
  Industry FamCnt NoFamCnt   FamPerc
1        4     27        0   0.00000
2        5      6        0   0.00000
3        6      5        6  54.54545
4        9      0        6 100.00000

Answer 3

Base R 解决方案（注意：在向量名称中保留空格通常不是好的做法）

# Reshape / Rename the input data: 

ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),

                                           c("fam", "industry", "count")),
               direction = "wide",

               idvar = "industry", 

               timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))

# Transform the data frame to contain the final equation: 

final_df <- transform(replace(ir_df, is.na(ir_df), 0), 

                      `Percent Family Firms In Industry` = 

                        round(`Family Firms` /

                        rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)

                      * 100, 2))

数据：

df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L, 
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L, 
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L, 
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L, 
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L, 
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, 
-50L))

根据R中的分类变量比较两组

compare two groups based on categorial variable in R

r

frequency

conditional-statements