是否有等同于 SELECT...COUNT(*)...GROUP BY... 的 tidyverse?
Is there a tidyverse equivalent to SELECT...COUNT(*)...GROUP BY...?
我想了解如何在 tidyverse 中完成 "group by" 和 "count" 功能。我看了很多帖子都没有找到我想要的东西;如果已经发布了对此的答案,我将不胜感激 link.
例如,我正在寻找数据中的异常值;我想知道哪些地方收到的 "bad" 措施最多:
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
> df
place measure rating
1 AL meas1 good
2 AK meas1 good
3 AZ meas1 good
4 AR meas1 bad
5 CA meas1 bad
6 CO meas1 bad
7 CT meas1 bad
8 DE meas1 good
9 FL meas1 good
10 GA meas1 good
....(etc).....
我想了解如何使用 tidyverse 执行此操作。这种使用 sqldf 的方法给了我想要的,即告诉我哪些地方的 "bad" 评级最高,并按它们的 "bad-ness"
对这些地方进行排名
library(sqldf)
sqldf("SELECT place, rating, COUNT(*) AS Count FROM df GROUP BY place, rating ORDER BY rating, count DESC").
place rating Count
1 CA bad 3
2 AK bad 2
3 AR bad 1
4 CO bad 1
5 CT bad 1
6 DE bad 1
7 FL bad 1
8 GA bad 1
9 AL good 4
10 AZ good 4
11 HI good 4
....(etc)....
有没有办法在 tidyverse 中得到类似的结果?
为了介绍 tidyverse 中的这些基本操作,我建议首先阅读 Wickham 和 Grolemund 的优秀 R for Data Science:http://r4ds.had.co.nz/
您可以使用 dplyr 和 magrittr 包以简单易懂的方式执行以下操作:
# Install the tidyverse
library(tidyverse)
# Create data
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
# Do some analysis
df %>%
group_by(place) %>%
summarise(mean_score = mean(rating == "good"), n = n()) %>%
arrange(desc(mean_score))
在这里,我们 "group by" 餐厅名称 "then" "summarise" 每个分组按它收到的 'good' 评分的平均数(创建一个新变量),"then" "arrange" 按此 'mean_score'.
降序输出
我们还在 summarize 函数中创建了新的 'n' 变量,该变量计算每个平均值所基于的评分数量(即,如果我们看到一家餐厅只有 2 个评分,我们就会知道平均值可能不具有代表性:请参阅 http://www.evanmiller.org/how-not-to-sort-by-average-rating.html 以获取有关此的综合示例)。
我想了解如何在 tidyverse 中完成 "group by" 和 "count" 功能。我看了很多帖子都没有找到我想要的东西;如果已经发布了对此的答案,我将不胜感激 link.
例如,我正在寻找数据中的异常值;我想知道哪些地方收到的 "bad" 措施最多:
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
> df
place measure rating
1 AL meas1 good
2 AK meas1 good
3 AZ meas1 good
4 AR meas1 bad
5 CA meas1 bad
6 CO meas1 bad
7 CT meas1 bad
8 DE meas1 good
9 FL meas1 good
10 GA meas1 good
....(etc).....
我想了解如何使用 tidyverse 执行此操作。这种使用 sqldf 的方法给了我想要的,即告诉我哪些地方的 "bad" 评级最高,并按它们的 "bad-ness"
对这些地方进行排名library(sqldf)
sqldf("SELECT place, rating, COUNT(*) AS Count FROM df GROUP BY place, rating ORDER BY rating, count DESC").
place rating Count
1 CA bad 3
2 AK bad 2
3 AR bad 1
4 CO bad 1
5 CT bad 1
6 DE bad 1
7 FL bad 1
8 GA bad 1
9 AL good 4
10 AZ good 4
11 HI good 4
....(etc)....
有没有办法在 tidyverse 中得到类似的结果?
为了介绍 tidyverse 中的这些基本操作,我建议首先阅读 Wickham 和 Grolemund 的优秀 R for Data Science:http://r4ds.had.co.nz/
您可以使用 dplyr 和 magrittr 包以简单易懂的方式执行以下操作:
# Install the tidyverse
library(tidyverse)
# Create data
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
# Do some analysis
df %>%
group_by(place) %>%
summarise(mean_score = mean(rating == "good"), n = n()) %>%
arrange(desc(mean_score))
在这里,我们 "group by" 餐厅名称 "then" "summarise" 每个分组按它收到的 'good' 评分的平均数(创建一个新变量),"then" "arrange" 按此 'mean_score'.
降序输出我们还在 summarize 函数中创建了新的 'n' 变量,该变量计算每个平均值所基于的评分数量(即,如果我们看到一家餐厅只有 2 个评分,我们就会知道平均值可能不具有代表性:请参阅 http://www.evanmiller.org/how-not-to-sort-by-average-rating.html 以获取有关此的综合示例)。