从数据框中的组中获取标准偏差
Get Standard Deviation From Groups in Data Frame
我有一个格式如下的数据框:
user <- c(1,1,2,2,2,2,3,3,3)
answer_num <- c(1,2,3,3,4,4,5,5,6)
df <- data.frame(user,answer_num)
我正在尝试收集有关每个用户的答案实例数的统计信息。例如,我可以获得每个答案的平均实例数:
library(dplyr)
df %>% group_by(user) %>% summarise(inst_per_answer = n()/length(unique(answer_num)))
这给了我:
user inst_per_answer
1 1 1.0
2 2 2.0
3 3 1.5
如何得到每个答案的实例数的标准差?
澄清:
我正在寻找每个答案的实例数的标准差。例如,用户 1 有 1 个答案 1 实例和 1 个答案 2 实例。因此,标准差为 0 - sd(c(1,1))
。用户 3 有 2 个答案 5 实例和 1 个答案 6 实例,sd 为 0.7 - sd(c(2,1))
.
也许试试这个
df %>%
group_by(user, answer_num) %>%
summarise(n = n()) %>%
summarise(sd_per_user = sd(n))
# Source: local data frame [3 x 2]
#
# user sd_per_user
# 1 1 0.0000000
# 2 2 0.0000000
# 3 3 0.7071068
或更短的版本
df %>%
count(user, answer_num) %>%
summarise(sd_per_user = sd(n))
# Source: local data frame [3 x 2]
#
# user sd_per_user
# 1 1 0.0000000
# 2 2 0.0000000
# 3 3 0.7071068
或data.table
版本(使用@Thelas base R idea)
library(data.table)
setDT(df)[, .(sd_per_user = sd(table(answer_num))), by = user]
# user sd_per_user
# 1: 1 0.0000000
# 2: 2 0.0000000
# 3: 3 0.7071068
对于那些对 sqldf
感兴趣的人,有两个选择:
RSQLite STDEV
:
library(sqldf)
sqldf("SELECT user, STDEV(n) AS sd
FROM (SELECT user, answer_num, count(answer_num) AS n
FROM df GROUP BY user,answer_num)
GROUP BY user")
RH2,STDDEV
或 STDDEV_SAMP
:
library(RH2)
sqldf("SELECT user, STDDEV(n) AS sd
FROM (SELECT user, answer_num, COUNT(answer_num) AS n
FROM df GROUP BY user,answer_num)
GROUP BY user")
输出:
user sd
1 1 0.0000000
2 2 0.0000000
3 3 0.7071068
我有一个格式如下的数据框:
user <- c(1,1,2,2,2,2,3,3,3)
answer_num <- c(1,2,3,3,4,4,5,5,6)
df <- data.frame(user,answer_num)
我正在尝试收集有关每个用户的答案实例数的统计信息。例如,我可以获得每个答案的平均实例数:
library(dplyr)
df %>% group_by(user) %>% summarise(inst_per_answer = n()/length(unique(answer_num)))
这给了我:
user inst_per_answer
1 1 1.0
2 2 2.0
3 3 1.5
如何得到每个答案的实例数的标准差?
澄清:
我正在寻找每个答案的实例数的标准差。例如,用户 1 有 1 个答案 1 实例和 1 个答案 2 实例。因此,标准差为 0 - sd(c(1,1))
。用户 3 有 2 个答案 5 实例和 1 个答案 6 实例,sd 为 0.7 - sd(c(2,1))
.
也许试试这个
df %>%
group_by(user, answer_num) %>%
summarise(n = n()) %>%
summarise(sd_per_user = sd(n))
# Source: local data frame [3 x 2]
#
# user sd_per_user
# 1 1 0.0000000
# 2 2 0.0000000
# 3 3 0.7071068
或更短的版本
df %>%
count(user, answer_num) %>%
summarise(sd_per_user = sd(n))
# Source: local data frame [3 x 2]
#
# user sd_per_user
# 1 1 0.0000000
# 2 2 0.0000000
# 3 3 0.7071068
或data.table
版本(使用@Thelas base R idea)
library(data.table)
setDT(df)[, .(sd_per_user = sd(table(answer_num))), by = user]
# user sd_per_user
# 1: 1 0.0000000
# 2: 2 0.0000000
# 3: 3 0.7071068
对于那些对 sqldf
感兴趣的人,有两个选择:
RSQLite STDEV
:
library(sqldf)
sqldf("SELECT user, STDEV(n) AS sd
FROM (SELECT user, answer_num, count(answer_num) AS n
FROM df GROUP BY user,answer_num)
GROUP BY user")
RH2,STDDEV
或 STDDEV_SAMP
:
library(RH2)
sqldf("SELECT user, STDDEV(n) AS sd
FROM (SELECT user, answer_num, COUNT(answer_num) AS n
FROM df GROUP BY user,answer_num)
GROUP BY user")
输出:
user sd
1 1 0.0000000
2 2 0.0000000
3 3 0.7071068