我如何计算正确的平均值?

How do I calculate the right mean?

我有一个显示多个国家双边出口的数据集。由于数据波动,我需要计算年份组的平均值。并非所有国家/地区都包含准确的年份。有些开始较晚,有些之间有差距 - 这意味着,有些年份缺失(但没有 NA 条目)。在一位了不起的社区成员的帮助下,我已经设法将数据分成几部分:year_group.

下面我列出了另外两个问题以及我的代码,错误的输出和底部的数据集的一些示例输入数据 total_trade

问题 1

我面临的问题是代码没有计算出正确的方法。当我手动计算结果时,我得到的结果与我的代码不同。 (见下文)

这是我的代码

# create vectors for coding 4 years average
year_group_break <- c(1999, 2003, 2007, 2011, 2015, 2019)
year_group_labels <- c("1999-2002", "2003-2006", "2007-2010", "2011-2014", "2015-2018")
years <- c(1999, 2000, 2001, 2002,2003, 2004,   2005,   2006,   2007,   2008,   2009,   2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)


FourY_av <- total_trade %>%
  # create year_group variable for average values with above predefined labels and cuts, 
  # chose right = FALSE to take cut before year_group_break
  mutate(year_group = cut(Year, breaks = year_group_break,
                          labels  = year_group_labels,
                          include.lowest = TRUE, right = FALSE)) %>%
  # add column with mean of total trade per four year period: "avg_year_group_total"
  group_by(ReporterName, year_group) %>%
  mutate(total_year_group = mean(Total_Year)) %>%
  arrange(ReporterName,PartnerName, desc(Year))
View(FourY_av)

下面是错误的输出 此输出是 错误的 因为 total_year_group(安哥拉年份组“2015-2018”的平均值)应该是 34746013.5(手动计算时)而不是 34907582。(如在输出中)我的错误在哪里?

> head(FourY_av)
# A tibble: 6 x 9
# Groups:   ReporterName, year_group [1]
   Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group total_year_group
  <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>                 <dbl>
1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018         34907582.
2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018         34907582.
3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018         34907582.
4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018         34907582.
5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018         34907582.
6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018         34907582.

问题2

另一个问题是并非所有国家/地区都显示所有年份的数据。一些起步较晚,一些有差距。 我仍然需要同年组的方法以确保可比性。数据集没有 NA。只是缺少数据。

例如安哥拉不涵盖 2008 年。数据集不包括 NA,但不包括安哥拉 2008 年的行和值。其他国家/地区正在显示 2008 年的数据。我仍然需要在 total_year_group 列中安哥拉可用年份的平均值(通过取 2007、2009 和 2010 年的平均值)。这不应该是 mean 函数的问题,对吧?或者在这种情况下我需要考虑一些特别的事情吗?

这是 total_trade

的一些 样本输入数据
dput(head(total_trade, n = 100))
structure(list(Year = c(2015L, 2018L, 2017L, 2016L, 2017L, 2015L, 
2018L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L, 
2015L, 2016L, 2017L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 
2016L, 2015L, 2018L, 2014L, 2015L, 2016L, 2017L, 2017L, 2018L, 
2016L, 2015L, 2016L, 2018L, 2017L, 2015L, 2010L, 2009L, 2016L, 
2013L, 2014L, 2018L, 2017L, 2015L, 2016L, 2017L, 2018L, 2017L, 
2018L, 2016L, 2016L, 2018L, 2007L, 2013L, 2009L, 2018L, 2015L, 
2016L, 2014L, 2010L, 2017L, 2012L, 2011L, 2018L, 2016L, 2015L, 
2016L, 2011L, 2018L, 2017L, 2015L, 2015L, 2016L, 2018L, 2017L, 
2015L, 2015L, 2016L, 2018L, 2017L, 2007L, 2014L, 2010L, 2013L, 
2011L, 2009L, 2012L, 2017L, 2018L, 2016L, 2015L, 2015L, 2015L, 
2017L, 2016L, 2018L, 2015L), ReporterName = c("Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola"
), PartnerName = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Albania", "Albania", "Albania", "Algeria", "Algeria", 
"Algeria", "Algeria", "American Samoa", "Andorra", "Andorra", 
"Antigua and Barbuda", "Antigua and Barbuda", "Antigua and Barbuda", 
"Antigua and Barbuda", "Argentina", "Argentina", "Argentina", 
"Argentina", "Armenia", "Armenia", "Armenia", "Armenia", "Australia", 
"Australia", "Australia", "Australia", "Australia", "Austria", 
"Austria", "Austria", "Austria", "Azerbaijan", "Azerbaijan", 
"Azerbaijan", "Azerbaijan", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahrain", "Bahrain", "Bahrain", "Bangladesh", 
"Bangladesh", "Bangladesh", "Barbados", "Belarus", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belize", "Belize", 
"Belize", "Benin", "Benin", "Benin", "Benin", "Benin", "Bhutan", 
"Bolivia", "Bolivia", "Bolivia", "Bolivia", "Botswana", "Botswana", 
"Botswana", "Botswana", "Brazil", "Brazil", "Brazil", "Brazil", 
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", 
"British Virgin Islands", "Brunei", "Bulgaria", "Bulgaria", "Bulgaria", 
"Bulgaria"), PartnerISO3 = c("AFG", "AFG", "AFG", "AFG", "ALB", 
"ALB", "ALB", "DZA", "DZA", "DZA", "DZA", "ASM", "AND", "AND", 
"ATG", "ATG", "ATG", "ATG", "ARG", "ARG", "ARG", "ARG", "ARM", 
"ARM", "ARM", "ARM", "AUS", "AUS", "AUS", "AUS", "AUS", "AUT", 
"AUT", "AUT", "AUT", "AZE", "AZE", "AZE", "AZE", "BHS", "BHS", 
"BHS", "BHS", "BHS", "BHS", "BHS", "BHS", "BHR", "BHR", "BHR", 
"BGD", "BGD", "BGD", "BRB", "BLR", "BEL", "BEL", "BEL", "BEL", 
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BLZ", "BLZ", 
"BLZ", "BEN", "BEN", "BEN", "BEN", "BEN", "BTN", "BOL", "BOL", 
"BOL", "BOL", "BWA", "BWA", "BWA", "BWA", "BRA", "BRA", "BRA", 
"BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "VGB", 
"BRN", "BGR", "BGR", "BGR", "BGR"), `TradeValue in 1000 USD` = c(39.586, 
19.353, 2.248, 0.775, 39.723, 2.259, 2.38, 2169.123, 2322.463, 
2241.599, 245.226, 12.007, 5.975, 0.326, 422.006, 155.467, 47.018, 
54.774, 483.147, 142.23, 98.7, 61.362, 60.105, 30.494, 0.99, 
0.731, 40220.092, 45435.804, 16096.404, 8546.882, 1904.301, 627.179, 
433.699, 23.118, 5.124, 985.67, 600.371, 143.356, 9.926, 140139.415, 
108214.936, 64444.203, 100210.999, 52974.059, 7322.893, 145.791, 
26.995, 4.847, 5.187, 1.958, 125.722, 55.22, 2.75, 3.366, 54.31, 
107976.895, 123610.469, 66757.2, 67763.201, 50046.64, 40199.706, 
52383.95, 45614.873, 28690.458, 52907.343, 39328.574, 452.078, 
5.82, 0.32, 970.324, 1700.981, 804.478, 332.216, 69.342, 1.632, 
1530.58, 308.752, 62.569, 19.822, 55.241, 37.029, 16.917, 0.198, 
874217.786, 1032751.313, 509259.955, 428750.075, 333280.441, 
192964.08, 315316.932, 119947.132, 141486.749, 66556.728, 1273.093, 
5.064, 22.324, 158.252, 33.583, 8.435, 0.077), Total_Year = c(33924937.48, 
42096736.31, 34904881.111, 28057499.527, 34904881.111, 33924937.48, 
42096736.31, 28057499.527, 33924937.48, 34904881.111, 42096736.31, 
42096736.31, 34904881.111, 42096736.31, 42096736.31, 33924937.48, 
28057499.527, 34904881.111, 28057499.527, 33924937.48, 34904881.111, 
42096736.31, 42096736.31, 34904881.111, 28057499.527, 33924937.48, 
42096736.31, 58672369.19, 33924937.48, 28057499.527, 34904881.111, 
34904881.111, 42096736.31, 28057499.527, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 52612114.76, 40639411.73, 
28057499.527, 67712526.544, 58672369.19, 42096736.31, 34904881.111, 
33924937.48, 28057499.527, 34904881.111, 42096736.31, 34904881.111, 
42096736.31, 28057499.527, 28057499.527, 42096736.31, 44177783.072, 
67712526.544, 40639411.73, 42096736.31, 33924937.48, 28057499.527, 
58672369.19, 52612114.76, 34904881.111, 70863076.416, 66427390.221, 
42096736.31, 28057499.527, 33924937.48, 28057499.527, 66427390.221, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 44177783.072, 58672369.19, 52612114.76, 
67712526.544, 66427390.221, 40639411.73, 70863076.416, 34904881.111, 
42096736.31, 28057499.527, 33924937.48, 33924937.48, 33924937.48, 
34904881.111, 28057499.527, 42096736.31, 33924937.48), pct_by_partner_year = c(0.000116687024179005, 
4.59726850497024e-05, 6.44035999679013e-06, 2.7621848456389e-06, 
0.000113803567683494, 6.65881846158674e-06, 5.65364493454718e-06, 
0.0077309918437765, 0.00684588733986371, 0.00642202158738646, 
0.000582529719629944, 2.8522401146684e-05, 1.71179497245645e-05, 
7.74406827169068e-07, 0.00100246726228929, 0.000458267609458834, 
0.000167577299448064, 0.000156923611416451, 0.00172198880208503, 
0.000419249114560196, 0.000282768474948037, 0.00014576426910659, 
0.000142778289407966, 8.73631395649993e-05, 3.5284683834613e-06, 
2.15475710288619e-06, 0.0955420669759755, 0.0774398658640565, 
0.0474471147057807, 0.0304620231456308, 0.00545568682484317, 
0.00179682319502973, 0.00103024376238159, 8.23950829180388e-05, 
1.51039335091503e-05, 0.00351303578942051, 0.00142616994243657, 
0.000410704736521284, 2.92587127267419e-05, 0.2663633948935, 
0.266280763902189, 0.229686194730164, 0.147994771595005, 0.0902879153020956, 
0.0173953936620509, 0.000417680838208199, 7.95727332317548e-05, 
1.72752386410474e-05, 1.48603858110989e-05, 4.65119192514428e-06, 
0.000360184581635431, 0.000131174064405754, 9.80130106517029e-06, 
1.19967925037684e-05, 0.000129012376636663, 0.244414471464133, 
0.18255184868885, 0.164267141570654, 0.160970200874938, 0.147521686751831, 
0.143276153177212, 0.0892821454514031, 0.0867003221749986, 0.0821961201035531, 
0.0746613690455784, 0.0592053577133712, 0.00107390272887404, 
2.07431171633786e-05, 9.43258923288073e-07, 0.00345834096536738, 
0.0025606620918584, 0.00191102225615742, 0.000951775194258733, 
0.000204398313308255, 4.81062050876917e-06, 0.00545515468521031, 
0.000733434529761055, 0.000179255731601051, 5.84289949294256e-05, 
0.000162833019316739, 0.000131975409869888, 4.01860131755188e-05, 
5.6725590719059e-07, 1.97886296054109, 1.76020046106476, 0.967951882039117, 
0.633191666126054, 0.50172141324715, 0.474820062066878, 0.444966473299775, 
0.34363999584631, 0.336099093188823, 0.237215465105691, 0.00375267603882995, 
1.49270724610338e-05, 6.58041006358842e-05, 0.000453380716286491, 
0.00011969348860786, 2.00371827827334e-05, 2.26971678416193e-07
)), row.names = c(NA, -100L), groups = structure(list(Year = c(2007L, 
2007L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L, 
2011L, 2012L, 2012L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 
2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L
), ReporterName = c("Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola"), PartnerName = c("Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Bahamas, The", 
"Belgium", "Brazil", "Belgium", "Benin", "Brazil", "Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Australia", "Bahamas, The", 
"Belgium", "Brazil", "Afghanistan", "Albania", "Algeria", "Antigua and Barbuda", 
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", 
"Bahamas, The", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", 
"Botswana", "Brazil", "British Virgin Islands", "Brunei", "Bulgaria", 
"Afghanistan", "Algeria", "Antigua and Barbuda", "Argentina", 
"Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas, The", 
"Bahrain", "Bangladesh", "Barbados", "Belgium", "Belize", "Benin", 
"Bolivia", "Botswana", "Brazil", "Bulgaria", "Afghanistan", "Albania", 
"Algeria", "Andorra", "Antigua and Barbuda", "Argentina", "Armenia", 
"Australia", "Austria", "Azerbaijan", "Bahamas, The", "Bahrain", 
"Bangladesh", "Belgium", "Benin", "Bolivia", "Botswana", "Brazil", 
"Bulgaria", "Afghanistan", "Albania", "Algeria", "American Samoa", 
"Andorra", "Antigua and Barbuda", "Argentina", "Armenia", "Australia", 
"Austria", "Azerbaijan", "Bahamas, The", "Bahrain", "Bangladesh", 
"Belarus", "Belgium", "Belize", "Benin", "Bolivia", "Botswana", 
"Brazil", "Bulgaria"), .rows = structure(list(56L, 84L, 41L, 
    58L, 89L, 40L, 63L, 86L, 66L, 71L, 88L, 65L, 90L, 43L, 57L, 
    87L, 28L, 44L, 62L, 85L, 1L, 6L, 9L, 16L, 20L, 26L, 29L, 
    35L, 39L, 47L, 60L, 69L, 74L, 75L, 79L, 80L, 94L, 95L, 96L, 
    100L, 4L, 8L, 17L, 19L, 25L, 30L, 34L, 36L, 42L, 48L, 53L, 
    54L, 61L, 68L, 70L, 76L, 81L, 93L, 98L, 3L, 5L, 10L, 13L, 
    18L, 21L, 24L, 31L, 32L, 38L, 46L, 49L, 51L, 64L, 73L, 78L, 
    83L, 91L, 97L, 2L, 7L, 11L, 12L, 14L, 15L, 22L, 23L, 27L, 
    33L, 37L, 45L, 50L, 52L, 55L, 59L, 67L, 72L, 77L, 82L, 92L, 
    99L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), row.names = c(NA, 100L), class = c("tbl_df", "tbl", 
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"))

mean 的问题是数据中任何 ReporterName 的重复行。

问题一

total_trade %>%
  # create year_group variable for average values with above predefined labels and cuts, 
  # chose right = FALSE to take cut before year_group_break
  mutate(year_group = cut(Year, breaks = year_group_break,
                          labels  = year_group_labels,
                          include.lowest = TRUE, right = FALSE)) %>%
  # add column with mean of total trade per four year period: "avg_year_group_total"
  group_by(ReporterName, year_group) %>%
  mutate(dup = !duplicated(paste0(ReporterName, year_group, Total_Year)),
         total_year_group = sum(Total_Year * dup)/sum(dup)) %>%
  arrange(ReporterName,PartnerName, desc(Year))

# A tibble: 100 x 10
# Groups:   ReporterName, year_group [3]
    Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group dup   total_year_group
   <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>      <lgl>            <dbl>
 1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018  TRUE         34746014.
 2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018  TRUE         34746014.
 3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018  TRUE         34746014.
 4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018  TRUE         34746014.
 5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018  FALSE        34746014.
 6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018  FALSE        34746014.
 7  2015 Angola       Albania     ALB                            2.26   33924937.          0.00000666 2015-2018  FALSE        34746014.
 8  2018 Angola       Algeria     DZA                          245.     42096736.          0.000583   2015-2018  FALSE        34746014.
 9  2017 Angola       Algeria     DZA                         2242.     34904881.          0.00642    2015-2018  FALSE        34746014.
10  2016 Angola       Algeria     DZA                         2169.     28057500.          0.00773    2015-2018  FALSE        34746014.
# ... with 90 more rows

问题2

使用 tidyr 中的 complete。如果您可以显示所需的输出,我可以告诉您如何操作。