dcast 单列总结
dcast summarise on single column
我想旋转我的数据,以便我可以使用 dcast 获得平均存活率,但似乎不可能:
数据
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S
示例数据代码:
df <- structure(list(PassengerId = 1:6, Survived = structure(c(1L,
2L, 2L, 2L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
Pclass = c(3L, 1L, 3L, 1L, 3L, 3L), Name = c("Braund, Mr. Owen Harris",
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)", "Heikkinen, Miss. Laina",
"Futrelle, Mrs. Jacques Heath (Lily May Peel)", "Allen, Mr. William Henry",
"Moran, Mr. James"), Sex = c("male", "female", "female",
"female", "male", "male"), Age = c(22, 38, 26, 35, 35, NA
), SibSp = c(1L, 1L, 0L, 1L, 0L, 0L), Parch = c(0L, 0L, 0L,
0L, 0L, 0L), Ticket = c("A/5 21171", "PC 17599", "STON/O2. 3101282",
"113803", "373450", "330877"), Fare = c(7.25, 71.2833, 7.925,
53.1, 8.05, 8.4583), Cabin = c("", "C85", "", "C123", "",
""), Embarked = c("S", "C", "S", "S", "S", "Q")), .Names = c("PassengerId",
"Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch",
"Ticket", "Fare", "Cabin", "Embarked"), row.names = c(NA, 6L), class = "data.frame")
到目前为止的功能:
reshape2::dcast(titanic, Sex ~ ., mean)
期望的输出:
Row Label Average of Survived
Male 3.14156
Female 3.14156
目前,它 returns 这个错误:
Sex .
1 female NA
2 male NA
Warning messages:
1: In mean.default(.value[0], ...) :
argument is not numeric or logical: returning NA
我认为这可能与 reshape 中的 cast 函数有关,但这可能与 reshape2 有关吗?
使用 dplyr
试试怎么样?
library(dplyr)
output <- df %>%
dplyr::mutate(Survived = as.numeric(as.character(Survived))) %>%
dplyr::select(Sex, Survived) %>%
dplyr::group_by(Sex) %>%
dplyr::summarise(average_of_survived = mean(Survived))
output
## A tibble: 2 × 2
# Sex average_of_survived
# <chr> <dbl>
#1 female 1
#2 male 0
因此,您确实可以为此使用 dcast,但 Survived 是一个因素,它会引发错误,您需要定义要用作计算值的列。原来列顺序也不重要,这令人惊讶。
df$Survived <- as.numeric(as.character(df$Survived))
reshape2::dcast(df, Sex~., mean, value.var = "Survived")
# Sex .
#1 female 1
#2 male 0
这可以通过 reshape2
(或 data.table
)包中的 dcast()
完成,如 所示。
如果没有 dcast()
,您也可以直接使用 data.table
进行聚合:
library(data.table)
setDT(df)[, Survived := as.numeric(as.character(Survived))][, mean(Survived), by = Sex]
# Sex V1
#1: male 0
#2: female 1
df
由 Q 中的 dput()
给出。链接用于形成 "one-liner".
上面的更简洁的版本是
setDT(df)[, mean(as.numeric(as.character(Survived))), by = Sex]
我想旋转我的数据,以便我可以使用 dcast 获得平均存活率,但似乎不可能:
数据
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S
示例数据代码:
df <- structure(list(PassengerId = 1:6, Survived = structure(c(1L,
2L, 2L, 2L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
Pclass = c(3L, 1L, 3L, 1L, 3L, 3L), Name = c("Braund, Mr. Owen Harris",
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)", "Heikkinen, Miss. Laina",
"Futrelle, Mrs. Jacques Heath (Lily May Peel)", "Allen, Mr. William Henry",
"Moran, Mr. James"), Sex = c("male", "female", "female",
"female", "male", "male"), Age = c(22, 38, 26, 35, 35, NA
), SibSp = c(1L, 1L, 0L, 1L, 0L, 0L), Parch = c(0L, 0L, 0L,
0L, 0L, 0L), Ticket = c("A/5 21171", "PC 17599", "STON/O2. 3101282",
"113803", "373450", "330877"), Fare = c(7.25, 71.2833, 7.925,
53.1, 8.05, 8.4583), Cabin = c("", "C85", "", "C123", "",
""), Embarked = c("S", "C", "S", "S", "S", "Q")), .Names = c("PassengerId",
"Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch",
"Ticket", "Fare", "Cabin", "Embarked"), row.names = c(NA, 6L), class = "data.frame")
到目前为止的功能:
reshape2::dcast(titanic, Sex ~ ., mean)
期望的输出:
Row Label Average of Survived
Male 3.14156
Female 3.14156
目前,它 returns 这个错误:
Sex .
1 female NA
2 male NA
Warning messages:
1: In mean.default(.value[0], ...) :
argument is not numeric or logical: returning NA
我认为这可能与 reshape 中的 cast 函数有关,但这可能与 reshape2 有关吗?
使用 dplyr
试试怎么样?
library(dplyr)
output <- df %>%
dplyr::mutate(Survived = as.numeric(as.character(Survived))) %>%
dplyr::select(Sex, Survived) %>%
dplyr::group_by(Sex) %>%
dplyr::summarise(average_of_survived = mean(Survived))
output
## A tibble: 2 × 2
# Sex average_of_survived
# <chr> <dbl>
#1 female 1
#2 male 0
因此,您确实可以为此使用 dcast,但 Survived 是一个因素,它会引发错误,您需要定义要用作计算值的列。原来列顺序也不重要,这令人惊讶。
df$Survived <- as.numeric(as.character(df$Survived))
reshape2::dcast(df, Sex~., mean, value.var = "Survived")
# Sex .
#1 female 1
#2 male 0
这可以通过 reshape2
(或 data.table
)包中的 dcast()
完成,如
如果没有 dcast()
,您也可以直接使用 data.table
进行聚合:
library(data.table)
setDT(df)[, Survived := as.numeric(as.character(Survived))][, mean(Survived), by = Sex]
# Sex V1
#1: male 0
#2: female 1
df
由 Q 中的 dput()
给出。链接用于形成 "one-liner".
上面的更简洁的版本是
setDT(df)[, mean(as.numeric(as.character(Survived))), by = Sex]