dplyr_generate 占布尔行百分比的新列
dplyr_generate new column that takes a percentage of boolean rows
我有一个包含多列的非常大的数据集,但只会 select 2 列:父母教育水平和性别。
parent_edu gender n
<chr> <chr> <int>
1 associate's degree female 116
2 associate's degree male 106
3 bachelor's degree female 63
4 bachelor's degree male 55
5 high school female 94
6 high school male 102
7 master's degree female 36
8 master's degree male 23
9 some college female 118
10 some college male 108
11 some high school female 91
12 some high school male 88
从这里开始,我需要使用 count
函数生成一个新列 n 来统计有多少女性 parents 达到了该水平教育程度以及有多少男性 parents 具有该教育水平。
student1 %>%
count(parent_edu, gender) %>%
最后一步是尝试获取最后一列,其中包含不同性别的不同教育水平的平均值。因此,例如,我们有 "some college" 并且有 52% 的女性和 48% 的男性,然后可能 "high school" 和 47% 的女性和 53% 的男性。
到目前为止,我通过以下方式无效地使用 mutate
函数:
student1 %>%
count(parent_edu, gender) %>%
mutate(percentage =
任何人都可以指导我应该在其中输入什么样的方程式?或者使用 pipe
添加任何其他功能?
最终结果应如下所示:
parent_edu gender n percentage
<chr> <chr> <int> <dbl>
associate's degree female 116 0.52
associate's degree male 106 0.48
bachelor's degree female 63 0.53
bachelor's degree male 55 0.47
high school female 94 0.48
high school male 102 0.52
master's degree female 36 0.61
master's degree male 23 0.39
some college female 118 0.52
some college male 108 0.48
包括输出:
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
更新版本:
dput
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
解决方案:
df <- df %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column
最终答案是这样的:
student1 %>%
count(parent_edu, gender) %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column
我有一个包含多列的非常大的数据集,但只会 select 2 列:父母教育水平和性别。
parent_edu gender n
<chr> <chr> <int>
1 associate's degree female 116
2 associate's degree male 106
3 bachelor's degree female 63
4 bachelor's degree male 55
5 high school female 94
6 high school male 102
7 master's degree female 36
8 master's degree male 23
9 some college female 118
10 some college male 108
11 some high school female 91
12 some high school male 88
从这里开始,我需要使用 count
函数生成一个新列 n 来统计有多少女性 parents 达到了该水平教育程度以及有多少男性 parents 具有该教育水平。
student1 %>%
count(parent_edu, gender) %>%
最后一步是尝试获取最后一列,其中包含不同性别的不同教育水平的平均值。因此,例如,我们有 "some college" 并且有 52% 的女性和 48% 的男性,然后可能 "high school" 和 47% 的女性和 53% 的男性。
到目前为止,我通过以下方式无效地使用 mutate
函数:
student1 %>%
count(parent_edu, gender) %>%
mutate(percentage =
任何人都可以指导我应该在其中输入什么样的方程式?或者使用 pipe
添加任何其他功能?
最终结果应如下所示:
parent_edu gender n percentage
<chr> <chr> <int> <dbl>
associate's degree female 116 0.52
associate's degree male 106 0.48
bachelor's degree female 63 0.53
bachelor's degree male 55 0.47
high school female 94 0.48
high school male 102 0.52
master's degree female 36 0.61
master's degree male 23 0.39
some college female 118 0.52
some college male 108 0.48
包括输出:
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
更新版本:
dput
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
解决方案:
df <- df %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column
最终答案是这样的:
student1 %>%
count(parent_edu, gender) %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column