计算多因子变量中的出现次数
Counting number of occurrences in multiple factor variables
我有几列包含有关家庭性别构成的信息(总共 10 个变量)。我想统计每个家庭的男性人数
数据集的头:
gndr gndr2 gndr3 gndr4 gndr5 gndr6 gndr7 gndr8 gndr9 gndr10
1 Male Female <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 Female Male Female Male Male Male Male <NA> <NA> <NA>
3 Female Male Female <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 Male <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 Male Female Male Female Female Male Male <NA> <NA> <NA>
我想创建一个table,其中包含没有男性、有一个男性、有两个男性等等的家庭数量。
dplyr
和 tidyr
包中是否有任何代码可以做到这一点?
这是一个非常标准的用例,其中数据以 "wide" 形式收集,最好以 "long" 形式使用。不同之处在于,在长格式中,您只有一列用于性别,另一列用于该性别所属的个人。我们将使用 tidyr::gather
将其整理成长格式,并使用 dplyr 来汇总有 1、2、3、... 男性的家庭数量。
library(dplyr)
library(tidyr)
wide.df <- tribble(
~gndr, ~gndr2, ~gndr3, ~gndr4, ~gndr5, ~gndr6, ~gndr7, ~gndr8, ~gndr9, ~gndr10,
"Male", "Female", NA, NA, NA, NA, NA, NA, NA, NA,
"Female", "Male", "Female", "Male", "Male", "Male", "Male", NA, NA, NA,
"Female", "Male", "Female", NA, NA, NA, NA, NA, NA, NA,
"Male", NA, NA, NA, NA, NA, NA, NA, NA, NA,
"Male", "Female", "Male", "Female", "Female", "Male", "Male", NA, NA, NA
)
wide.df %>%
mutate(household = 1:nrow(.)) %>%
gather(key = "individual", value = "gender", -household) %>%
mutate(individual = factor(individual),
gender = factor(gender)) %>%
filter(gender == "Male") %>%
group_by(household) %>%
summarize(males = n()) %>%
arrange(desc(males)) %>%
group_by(males) %>%
summarize(households = n())
# # A tibble: 3 x 2
# males households
# <int> <int>
# 1 3
# 4 1
# 5 1
或者如果要统计有男有女的家庭数,那就多加一个分组。
wide.df %>%
mutate(household = 1:nrow(.)) %>%
gather(key = "individual", value = "gender", -household) %>%
mutate(individual = factor(individual),
gender = factor(gender)) %>%
filter(!is.na(gender)) %>%
group_by(household, gender) %>%
summarize(count = n()) %>%
group_by(gender, count) %>%
summarize(households = n()) %>%
arrange(count)
# # A tibble: 6 x 3
# # Groups: gender [2]
# gender count households
# <fct> <int> <int>
# Female 1 1
# Male 1 3
# Female 2 2
# Female 3 1
# Male 4 1
# Male 5 1
我有几列包含有关家庭性别构成的信息(总共 10 个变量)。我想统计每个家庭的男性人数
数据集的头:
gndr gndr2 gndr3 gndr4 gndr5 gndr6 gndr7 gndr8 gndr9 gndr10
1 Male Female <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 Female Male Female Male Male Male Male <NA> <NA> <NA>
3 Female Male Female <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 Male <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 Male Female Male Female Female Male Male <NA> <NA> <NA>
我想创建一个table,其中包含没有男性、有一个男性、有两个男性等等的家庭数量。
dplyr
和 tidyr
包中是否有任何代码可以做到这一点?
这是一个非常标准的用例,其中数据以 "wide" 形式收集,最好以 "long" 形式使用。不同之处在于,在长格式中,您只有一列用于性别,另一列用于该性别所属的个人。我们将使用 tidyr::gather
将其整理成长格式,并使用 dplyr 来汇总有 1、2、3、... 男性的家庭数量。
library(dplyr)
library(tidyr)
wide.df <- tribble(
~gndr, ~gndr2, ~gndr3, ~gndr4, ~gndr5, ~gndr6, ~gndr7, ~gndr8, ~gndr9, ~gndr10,
"Male", "Female", NA, NA, NA, NA, NA, NA, NA, NA,
"Female", "Male", "Female", "Male", "Male", "Male", "Male", NA, NA, NA,
"Female", "Male", "Female", NA, NA, NA, NA, NA, NA, NA,
"Male", NA, NA, NA, NA, NA, NA, NA, NA, NA,
"Male", "Female", "Male", "Female", "Female", "Male", "Male", NA, NA, NA
)
wide.df %>%
mutate(household = 1:nrow(.)) %>%
gather(key = "individual", value = "gender", -household) %>%
mutate(individual = factor(individual),
gender = factor(gender)) %>%
filter(gender == "Male") %>%
group_by(household) %>%
summarize(males = n()) %>%
arrange(desc(males)) %>%
group_by(males) %>%
summarize(households = n())
# # A tibble: 3 x 2
# males households
# <int> <int>
# 1 3
# 4 1
# 5 1
或者如果要统计有男有女的家庭数,那就多加一个分组。
wide.df %>%
mutate(household = 1:nrow(.)) %>%
gather(key = "individual", value = "gender", -household) %>%
mutate(individual = factor(individual),
gender = factor(gender)) %>%
filter(!is.na(gender)) %>%
group_by(household, gender) %>%
summarize(count = n()) %>%
group_by(gender, count) %>%
summarize(households = n()) %>%
arrange(count)
# # A tibble: 6 x 3
# # Groups: gender [2]
# gender count households
# <fct> <int> <int>
# Female 1 1
# Male 1 3
# Female 2 2
# Female 3 1
# Male 4 1
# Male 5 1