Dplyr summarize "sum" 函数仅适用于子集而不适用于 R 中的较大数据集
Dplyr summarize "sum" function works correctly only for subset not the larger dataset in R
我有一个数据集,其中我在 12 个月内对 6 个站点的 4 个物种的丰度进行了采样(一个站点内有 5 个重复)。我正在尝试计算站点内复制的各种摘要统计数据(平均值和总和)。
例如,在数据集显示 (df) 中,4 月 -- Site #1,物种“A”显示,我想要 sp1.abundance
在 site/month/year 的总和。总和明显接近~6
在测试#1 中,当我使用所有 12 个月、站点等的完整数据集时;该值 (sum.N) 变为
结果是 26.
在测试#2 中,当我使用子集时,即只有月份“Apr”和物种“A”时,总和是正确的。
我运行遇到了sum.nimi
的类似问题
我无法理解为什么会出现差异。此处可重现的示例并不是最有帮助的。
df <- read.csv(file = "test1.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.68 186. 26.7
2 2 Apr 2 A 1.84 3.93 153. 36.7
3 2 Apr 3 A 2.81 4.49 298. 56.3
4 2 Apr 4 A 2.43 3.77 261. 48.5
5 2 Apr 5 A 0 2.60 0 0
6 2 Apr 6 A 1.70 4.65 193. 34.1
7 2 Aug 1 A 0 0.791 0 0
8 2 Aug 2 A 0 0.590 0 0
9 2 Aug 3 A 0.717 0.795 38.6 14.3
10 2 Aug 4 A 0 0.366 0 0
11 2 Aug 5 A 0 0 0 0
12 2 Aug 6 A 0 0 0 0
13 1 Dec 1 A 0 1.29 0 0
14 1 Dec 2 A 1.50 1.12 88.9 30.0
15 1 Dec 3 A 1.12 2.41 57.0 22.4
16 1 Dec 4 A 0.660 0.495 32.7 13.2
17 1 Dec 5 A 0 1.54 0 0
18 1 Dec 6 A 2.28 3.44 291. 45.6
19 2 Feb 1 A 1.94 4.85 230. 38.9
20 2 Feb 2 A 1.62 3.06 180. 32.3
#子集(test2)
df <- read.csv(file = "test2.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
# A tibble: 6 × 8
# Groups: year, month, site [6]
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.57 39.5 6.67
2 2 Apr 2 A 1.84 3.41 14.7 9.18
3 2 Apr 3 A 2.81 3.17 32.2 14.1
4 2 Apr 4 A 2.43 2.61 29.7 12.1
5 2 Apr 5 A 0 3.47 0 0
6 2 Apr 6 A 1.70 4.49 37.5 8.52
[1]: https://i.stack.imgur.com/eCqBK.png
[2]: https://i.stack.imgur.com/lSyeW.png
[3]: https://i.stack.imgur.com/cJzAW.png
我不知道你是如何处理你的数据的,但看起来你可能以某种方式制作了 5 个 species-1 丰度数据的副本。如果我查看 April/site 1/物种 A 中的所有条目,物种 1 丰度列为 {3.00, 0, 0, 3.67, 0} 重复 4 次 (4*6.67 = 26.8;这些值与您的两个差异值相匹配)。数据中还有其他一些重复模式(例如,所有列 除了 species2
在行 {3, 13} 和 {9, 19} 中都是相同的。
我怀疑 messed-up 在上游某个点合并。
library(tidyverse)
df1 <- read_csv(file = "test1.csv")
filter(df1, month == "Apr", site ==1, species == "A")
# A tibble: 20 × 8
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 2 Apr 1 A 3.00 sp2.3.4 9.03 27.1
2 2 Apr 1 A 0 sp2.3.4 0 0
3 2 Apr 1 A 0 sp2.3.4 3.80 0
4 2 Apr 1 A 3.67 sp2.3.4 3.39 12.4
5 2 Apr 1 A 0 sp2.3.4 1.63 0
6 2 Apr 1 A 3.00 sp3.4.1 12.0 36.1
7 2 Apr 1 A 0 sp3.4.1 0 0
8 2 Apr 1 A 0 sp3.4.1 2.41 0
9 2 Apr 1 A 3.67 sp3.4.1 7.06 25.9
10 2 Apr 1 A 0 sp3.4.1 0 0
11 2 Apr 1 A 3.00 sp1.2.4 7.64 22.9
12 2 Apr 1 A 0 sp1.2.4 0 0
13 2 Apr 1 A 0 sp1.2.4 3.80 0
14 2 Apr 1 A 3.67 sp1.2.4 3.67 13.5
15 2 Apr 1 A 0 sp1.2.4 1.63 0
16 2 Apr 1 A 3.00 sp1.2.3 7.39 22.2
17 2 Apr 1 A 0 sp1.2.3 0 0
18 2 Apr 1 A 0 sp1.2.3 1.39 0
19 2 Apr 1 A 3.67 sp1.2.3 7.06 25.9
20 2 Apr 1 A 0 sp1.2.3 1.63 0
我有一个数据集,其中我在 12 个月内对 6 个站点的 4 个物种的丰度进行了采样(一个站点内有 5 个重复)。我正在尝试计算站点内复制的各种摘要统计数据(平均值和总和)。
例如,在数据集显示 (df) 中,4 月 -- Site #1,物种“A”显示,我想要 sp1.abundance
在 site/month/year 的总和。总和明显接近~6
在测试#1 中,当我使用所有 12 个月、站点等的完整数据集时;该值 (sum.N) 变为 结果是 26.
在测试#2 中,当我使用子集时,即只有月份“Apr”和物种“A”时,总和是正确的。
我运行遇到了sum.nimi
的类似问题我无法理解为什么会出现差异。此处可重现的示例并不是最有帮助的。
df <- read.csv(file = "test1.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.68 186. 26.7
2 2 Apr 2 A 1.84 3.93 153. 36.7
3 2 Apr 3 A 2.81 4.49 298. 56.3
4 2 Apr 4 A 2.43 3.77 261. 48.5
5 2 Apr 5 A 0 2.60 0 0
6 2 Apr 6 A 1.70 4.65 193. 34.1
7 2 Aug 1 A 0 0.791 0 0
8 2 Aug 2 A 0 0.590 0 0
9 2 Aug 3 A 0.717 0.795 38.6 14.3
10 2 Aug 4 A 0 0.366 0 0
11 2 Aug 5 A 0 0 0 0
12 2 Aug 6 A 0 0 0 0
13 1 Dec 1 A 0 1.29 0 0
14 1 Dec 2 A 1.50 1.12 88.9 30.0
15 1 Dec 3 A 1.12 2.41 57.0 22.4
16 1 Dec 4 A 0.660 0.495 32.7 13.2
17 1 Dec 5 A 0 1.54 0 0
18 1 Dec 6 A 2.28 3.44 291. 45.6
19 2 Feb 1 A 1.94 4.85 230. 38.9
20 2 Feb 2 A 1.62 3.06 180. 32.3
#子集(test2)
df <- read.csv(file = "test2.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
# A tibble: 6 × 8
# Groups: year, month, site [6]
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.57 39.5 6.67
2 2 Apr 2 A 1.84 3.41 14.7 9.18
3 2 Apr 3 A 2.81 3.17 32.2 14.1
4 2 Apr 4 A 2.43 2.61 29.7 12.1
5 2 Apr 5 A 0 3.47 0 0
6 2 Apr 6 A 1.70 4.49 37.5 8.52
[1]: https://i.stack.imgur.com/eCqBK.png
[2]: https://i.stack.imgur.com/lSyeW.png
[3]: https://i.stack.imgur.com/cJzAW.png
我不知道你是如何处理你的数据的,但看起来你可能以某种方式制作了 5 个 species-1 丰度数据的副本。如果我查看 April/site 1/物种 A 中的所有条目,物种 1 丰度列为 {3.00, 0, 0, 3.67, 0} 重复 4 次 (4*6.67 = 26.8;这些值与您的两个差异值相匹配)。数据中还有其他一些重复模式(例如,所有列 除了 species2
在行 {3, 13} 和 {9, 19} 中都是相同的。
我怀疑 messed-up 在上游某个点合并。
library(tidyverse)
df1 <- read_csv(file = "test1.csv")
filter(df1, month == "Apr", site ==1, species == "A")
# A tibble: 20 × 8
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 2 Apr 1 A 3.00 sp2.3.4 9.03 27.1
2 2 Apr 1 A 0 sp2.3.4 0 0
3 2 Apr 1 A 0 sp2.3.4 3.80 0
4 2 Apr 1 A 3.67 sp2.3.4 3.39 12.4
5 2 Apr 1 A 0 sp2.3.4 1.63 0
6 2 Apr 1 A 3.00 sp3.4.1 12.0 36.1
7 2 Apr 1 A 0 sp3.4.1 0 0
8 2 Apr 1 A 0 sp3.4.1 2.41 0
9 2 Apr 1 A 3.67 sp3.4.1 7.06 25.9
10 2 Apr 1 A 0 sp3.4.1 0 0
11 2 Apr 1 A 3.00 sp1.2.4 7.64 22.9
12 2 Apr 1 A 0 sp1.2.4 0 0
13 2 Apr 1 A 0 sp1.2.4 3.80 0
14 2 Apr 1 A 3.67 sp1.2.4 3.67 13.5
15 2 Apr 1 A 0 sp1.2.4 1.63 0
16 2 Apr 1 A 3.00 sp1.2.3 7.39 22.2
17 2 Apr 1 A 0 sp1.2.3 0 0
18 2 Apr 1 A 0 sp1.2.3 1.39 0
19 2 Apr 1 A 3.67 sp1.2.3 7.06 25.9
20 2 Apr 1 A 0 sp1.2.3 1.63 0