当你有多个数据的组时,在 R 中计算组标准偏差
Calculating group Standard Deviation in R, when you have groups with multiple data
我正在使用 R,我正在尝试正确计算我的标准偏差。
我的数据是这样的:
Target category wordproduced wordValue
wall A home .003
wall A table .005
widnow A cow .015
window B backyard .012
friend B dog .018
friend B chance .088
friend B spoon .002
big C country .009
big C pen .015
big C pub .012
money C palace .078
rail C wood .026
rail C ferrari .030
rail C car .062
science D phone .007
science D laboratory .009
science D side .019
water D ocean .013
water D river .020
所以,我有四个类别(A、B、C、D),我总共有 8 个单词。每个词属于一个类别。
所以,如果我想计算目标词产生的词的平均值,我会写这样的代码....
mydata %>%
group_by(category) %>%
summarise(TargetN = length(unique(Taregt)),
wPoroducedN = length(wordsproduced),
meanW = wProducedN/TargetN)
如果我用 mean() 函数计算平均值,它会得到错误的平均值,因为它会计算目标中的每个单词。例如,类别 A 只有 2 个唯一词,但总共有 3 个。所以,我需要计算我的平均跳水次数为 2。上面的代码解决了这个问题。但是在计算 SD 时,我得到了很多错误的答案或 NA。
例如,我试过这个...
mydata %>%
group_by(category) %>%
summarise(TargetN = length(unique(Taregt)),
wPoroducedN = length(wordsproduced),
meanW = wProducedN/TargetN,
SD = sd(length(wordproduced)))
在这里,我得到 NA.,而对于其他代码,我得到 0 o 独特目标的确切数量等。
我应该如何计算我的 SD?
正在添加可重现的数据.... **类别已更改为数字;相反,os ABCD 是 123(只有三个)
newDat <- structure(list(Target = c(
"permit",
"confusion",
"presion",
"transanction",
"sorprise",
"same",
"agony",
"prime",
"suffer",
"affect",
"car",
"neglect",
"intern",
"explore",
"image",
"pension",
"amature",
"terrified",
"importance",
"deal",
"replace",
"euforic",
"optimist",
"return",
"inmerse",
"doll",
"actor",
"singular",
"desctruction",
"dispute",
"tremor",
"profesional",
"redem",
"euforic",
"pen",
"pause",
"cultive",
"center",
"cheer",
"slace",
"recess",
"apple",
"introduction",
"despicable",
"offense",
"inteligent",
"hope",
"contender",
"stress",
"disgust"
), Category = c(
"3",
"1",
"1",
"1",
"1",
"1",
"1",
"2",
"2",
"2",
"2",
"2",
"1",
"1",
"2",
"2",
"1",
"1",
"2",
"1",
"1",
"1",
"1",
"2",
"1",
"1",
"3",
"1",
"1",
"1",
"1",
"1",
"1",
"1",
"2",
"3",
"1",
"3",
"1",
"2",
"2",
"1",
"1",
"1",
"1",
"2",
"1",
"3",
"1",
"1"
), wordproduced = c(
"liberty",
"intense",
"sad",
"serenity",
"afraid",
"sadness",
"hurt",
"freedom",
"depress",
"feeling",
"love",
"positive",
"river",
"palace",
"ilusion",
"stress",
"aliviated",
"violence",
"presion",
"damage",
"hate",
"happy",
"dwindle",
"spoon",
"kitchen",
"dog",
"backyard",
"alone",
"cat",
"confidence",
"fear",
"moving",
"house",
"ocean",
"territory",
"continent",
"sky",
"rainbow",
"approach",
"law",
"good",
"school",
"science",
"land",
"laboratory",
"engage",
"destiny",
"voice",
"arange",
"infertile"
), wordValue = c(
0.10,
0.09,
0.01,
0.1,
0.046,
0.316,
0.12,
0.03,
0.03,
0.02,
0.46,
0.19,
0.26,
0.070,
0.040,
0.01,
0.025,
0.03,
0.05,
0.089,
0.075,
0.03,
0.067,
0.04,
0.04,
0.1,
0.068,
0.055,
0.17,
0.075,
0.535,
0.06,
0.1,
0.12,
0.04,
0.08,
0.036,
0.1,
0.05,
0.050,
0.07,
0.05,
0.8,
0.05,
0.06,
0.08,
0.055,
0.04,
0.12,
0.049
)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
编辑添加的样本数据:
虽然我不确定您要做什么,但我可以告诉您那里有 NA
s,因为您要的是一个数字的 SD...说得通。即...length(wordsproduced)
将为您提供一个长度数字,一次一个类别。
我假设您想要每个 target
的 wordsproduced
个数的 SD,每个 category
。
因此,您计算了每个 Target
每个 category
的平均值 wordsproduced
,因此:
newDat_summary <- newDat %>%
group_by(Category) %>%
summarise(TargetN = length(unique(Target)),
wProducedN = length(wordproduced),
meanW = wProducedN/TargetN)
> newDat_summary
# A tibble: 3 x 4
Category TargetN wProducedN meanW
<chr> <int> <int> <dbl>
1 1 31 32 1.03
2 2 13 13 1
3 3 5 5 1
对于SD,我们首先需要分别找到每个category
中每个Target
的wordsproduced
个数:
newDat_summary2 <- newDat %>%
group_by(Category, Target) %>%
summarise(TargetN = length(unique(Target)),
wProducedN = length(wordproduced))
> newDat_summary2
# A tibble: 49 x 4
# Groups: Category [3]
Category Target TargetN wProducedN
<chr> <chr> <int> <int>
1 1 agony 1 1
2 1 amature 1 1
3 1 apple 1 1
4 1 cheer 1 1
5 1 confusion 1 1
6 1 cultive 1 1
7 1 deal 1 1
8 1 desctruction 1 1
9 1 despicable 1 1
10 1 disgust 1 1
# ... with 39 more rows
现在我们有多个值,我们可以找到它们之间的标准差:
newDat_summary3 <- newDat_summary2 %>% group_by(Category) %>%
summarise(SD = sd(wProducedN))
> mydata_summary3
# A tibble: 4 x 2
category SD
<chr> <dbl>
1 A 0.707
2 B 1.41
3 C 1.15
4 D 0.707
然后我们将其与 per Target
per category
:
的平均值相结合
newDat_summary <- merge(newDat_summary,newDat_summary3,by = "Category")
> newDat_summary
Category TargetN wProducedN meanW SD
1 1 31 32 1.032258 0.1796053
2 2 13 13 1.000000 0.0000000
3 3 5 5 1.000000 0.0000000
希望这就是您要找的。
我正在使用 R,我正在尝试正确计算我的标准偏差。
我的数据是这样的:
Target category wordproduced wordValue
wall A home .003
wall A table .005
widnow A cow .015
window B backyard .012
friend B dog .018
friend B chance .088
friend B spoon .002
big C country .009
big C pen .015
big C pub .012
money C palace .078
rail C wood .026
rail C ferrari .030
rail C car .062
science D phone .007
science D laboratory .009
science D side .019
water D ocean .013
water D river .020
所以,我有四个类别(A、B、C、D),我总共有 8 个单词。每个词属于一个类别。
所以,如果我想计算目标词产生的词的平均值,我会写这样的代码....
mydata %>%
group_by(category) %>%
summarise(TargetN = length(unique(Taregt)),
wPoroducedN = length(wordsproduced),
meanW = wProducedN/TargetN)
如果我用 mean() 函数计算平均值,它会得到错误的平均值,因为它会计算目标中的每个单词。例如,类别 A 只有 2 个唯一词,但总共有 3 个。所以,我需要计算我的平均跳水次数为 2。上面的代码解决了这个问题。但是在计算 SD 时,我得到了很多错误的答案或 NA。
例如,我试过这个...
mydata %>%
group_by(category) %>%
summarise(TargetN = length(unique(Taregt)),
wPoroducedN = length(wordsproduced),
meanW = wProducedN/TargetN,
SD = sd(length(wordproduced)))
在这里,我得到 NA.,而对于其他代码,我得到 0 o 独特目标的确切数量等。
我应该如何计算我的 SD?
正在添加可重现的数据.... **类别已更改为数字;相反,os ABCD 是 123(只有三个)
newDat <- structure(list(Target = c(
"permit",
"confusion",
"presion",
"transanction",
"sorprise",
"same",
"agony",
"prime",
"suffer",
"affect",
"car",
"neglect",
"intern",
"explore",
"image",
"pension",
"amature",
"terrified",
"importance",
"deal",
"replace",
"euforic",
"optimist",
"return",
"inmerse",
"doll",
"actor",
"singular",
"desctruction",
"dispute",
"tremor",
"profesional",
"redem",
"euforic",
"pen",
"pause",
"cultive",
"center",
"cheer",
"slace",
"recess",
"apple",
"introduction",
"despicable",
"offense",
"inteligent",
"hope",
"contender",
"stress",
"disgust"
), Category = c(
"3",
"1",
"1",
"1",
"1",
"1",
"1",
"2",
"2",
"2",
"2",
"2",
"1",
"1",
"2",
"2",
"1",
"1",
"2",
"1",
"1",
"1",
"1",
"2",
"1",
"1",
"3",
"1",
"1",
"1",
"1",
"1",
"1",
"1",
"2",
"3",
"1",
"3",
"1",
"2",
"2",
"1",
"1",
"1",
"1",
"2",
"1",
"3",
"1",
"1"
), wordproduced = c(
"liberty",
"intense",
"sad",
"serenity",
"afraid",
"sadness",
"hurt",
"freedom",
"depress",
"feeling",
"love",
"positive",
"river",
"palace",
"ilusion",
"stress",
"aliviated",
"violence",
"presion",
"damage",
"hate",
"happy",
"dwindle",
"spoon",
"kitchen",
"dog",
"backyard",
"alone",
"cat",
"confidence",
"fear",
"moving",
"house",
"ocean",
"territory",
"continent",
"sky",
"rainbow",
"approach",
"law",
"good",
"school",
"science",
"land",
"laboratory",
"engage",
"destiny",
"voice",
"arange",
"infertile"
), wordValue = c(
0.10,
0.09,
0.01,
0.1,
0.046,
0.316,
0.12,
0.03,
0.03,
0.02,
0.46,
0.19,
0.26,
0.070,
0.040,
0.01,
0.025,
0.03,
0.05,
0.089,
0.075,
0.03,
0.067,
0.04,
0.04,
0.1,
0.068,
0.055,
0.17,
0.075,
0.535,
0.06,
0.1,
0.12,
0.04,
0.08,
0.036,
0.1,
0.05,
0.050,
0.07,
0.05,
0.8,
0.05,
0.06,
0.08,
0.055,
0.04,
0.12,
0.049
)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
编辑添加的样本数据:
虽然我不确定您要做什么,但我可以告诉您那里有 NA
s,因为您要的是一个数字的 SD...说得通。即...length(wordsproduced)
将为您提供一个长度数字,一次一个类别。
我假设您想要每个 target
的 wordsproduced
个数的 SD,每个 category
。
因此,您计算了每个 Target
每个 category
的平均值 wordsproduced
,因此:
newDat_summary <- newDat %>%
group_by(Category) %>%
summarise(TargetN = length(unique(Target)),
wProducedN = length(wordproduced),
meanW = wProducedN/TargetN)
> newDat_summary
# A tibble: 3 x 4
Category TargetN wProducedN meanW
<chr> <int> <int> <dbl>
1 1 31 32 1.03
2 2 13 13 1
3 3 5 5 1
对于SD,我们首先需要分别找到每个category
中每个Target
的wordsproduced
个数:
newDat_summary2 <- newDat %>%
group_by(Category, Target) %>%
summarise(TargetN = length(unique(Target)),
wProducedN = length(wordproduced))
> newDat_summary2
# A tibble: 49 x 4
# Groups: Category [3]
Category Target TargetN wProducedN
<chr> <chr> <int> <int>
1 1 agony 1 1
2 1 amature 1 1
3 1 apple 1 1
4 1 cheer 1 1
5 1 confusion 1 1
6 1 cultive 1 1
7 1 deal 1 1
8 1 desctruction 1 1
9 1 despicable 1 1
10 1 disgust 1 1
# ... with 39 more rows
现在我们有多个值,我们可以找到它们之间的标准差:
newDat_summary3 <- newDat_summary2 %>% group_by(Category) %>%
summarise(SD = sd(wProducedN))
> mydata_summary3
# A tibble: 4 x 2
category SD
<chr> <dbl>
1 A 0.707
2 B 1.41
3 C 1.15
4 D 0.707
然后我们将其与 per Target
per category
:
newDat_summary <- merge(newDat_summary,newDat_summary3,by = "Category")
> newDat_summary
Category TargetN wProducedN meanW SD
1 1 31 32 1.032258 0.1796053
2 2 13 13 1.000000 0.0000000
3 3 5 5 1.000000 0.0000000
希望这就是您要找的。