dplyr_generate 占布尔行百分比的新列

dplyr_generate new column that takes a percentage of boolean rows

我有一个包含多列的非常大的数据集,但只会 select 2 列:父母教育水平和性别。

    parent_edu             gender     n
        <chr>              <chr>  <int>
     1 associate's degree female   116
     2 associate's degree male     106
     3 bachelor's degree  female    63
     4 bachelor's degree  male      55
     5 high school        female    94
     6 high school        male     102
     7 master's degree    female    36
     8 master's degree    male      23
     9 some college       female   118
    10 some college       male     108
    11 some high school   female    91
    12 some high school   male      88

从这里开始,我需要使用 count 函数生成一个新列 n 来统计有多少女性 parents 达到了该水平教育程度以及有多少男性 parents 具有该教育水平。

    student1 %>%
    count(parent_edu, gender) %>%

最后一步是尝试获取最后一列,其中包含不同性别的不同教育水平的平均值。因此,例如,我们有 "some college" 并且有 52% 的女性和 48% 的男性,然后可能 "high school" 和 47% 的女性和 53% 的男性。 到目前为止,我通过以下方式无效地使用 mutate 函数:

    student1 %>%
    count(parent_edu, gender) %>%
    mutate(percentage = 

任何人都可以指导我应该在其中输入什么样的方程式?或者使用 pipe 添加任何其他功能? 最终结果应如下所示:

    parent_edu         gender      n      percentage
    <chr>              <chr>      <int>    <dbl>
    associate's degree  female    116      0.52
    associate's degree  male      106      0.48
    bachelor's degree   female    63       0.53
    bachelor's degree   male      55       0.47
    high school         female    94       0.48
    high school         male      102      0.52
    master's degree     female    36       0.61
    master's degree     male      23       0.39
    some college        female    118      0.52
    some college        male      108      0.48

包括输出:

df <- structure(list(parent_edu = c("associate's degree", "associate's degree", 
"bachelor's degree", "bachelor's degree", "high school", "high school", 
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male", 
"female", "male", "female", "male"), n = c(116, 106, 63, 55, 
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")) 

更新版本:

dput

df <- structure(list(parent_edu = c("associate's degree", "associate's degree", 
"bachelor's degree", "bachelor's degree", "high school", "high school", 
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male", 
"female", "male", "female", "male"), n = c(116, 106, 63, 55, 
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")) 

解决方案:

df <- df %>%
  group_by(parent_edu) %>% # grouping by parent education 
  mutate(total = sum(n)) %>% # total within groups
  mutate(percentage = (n/total)) %>% # calculating percentage
  mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
  select(-total) # dropping the total column

最终答案是这样的:

    student1 %>%
    count(parent_edu, gender) %>%
    group_by(parent_edu) %>% # grouping by parent education 
    mutate(total = sum(n)) %>% # total within groups
    mutate(percentage = (n/total)) %>% # calculating percentage
    mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
    select(-total) # dropping the total column