R:通过转置和使用两组列来构造数据table

R: construct data table by transposing and using two group columns

我有一个名为dt的数据table,如下所示(有点类似)。

dt:

| Gender| Year | Length| Weight | Athlete_flag | Age|
|:--    |:---- |:------|:------ |:------------ |:---|
| M     | 2009 | 188   | 89     | 1            |17  |
| F     | 2007 | 170   | 65     | 1            |19  |
| M     |2007  |172    |90      |0             |45  |
|M      |2017  |160    |70      |0             |34  |
|F      |2017  |160    |70      |0             |24  |
|M      |2018  |160    |70      |0             |58  |
|F      |2016  |160    |70      |0             |49  |
|F      |2017  |160    |70      |0             |37  |

我想创建一个名为 dt_new 的数据 table,其中包含前一个 table 中特定列按年份的描述性统计信息。我会有两个组列,“变量”和“年”。换句话说,table 如下所示:

dt_new:

Variable|Year |Number of observations| Min|Mean    |Max|
:------ |:----|:----                 |:---|:------ |:--|
Length  | 2007|2                     |170 |171     |172|
Length  | 2009|1                     |188 |188     |188|
Length  | 2016|1                     |160 |160     |162|
Length  | 2017|3                     |160 |160     |160|
Length  | 2018|1                     |160 |160     |160|
Weight  | 2007|2                     |65  |77.5    |90 |
Weight  | 2009|1                     |89  |89      |89 |
Weight  | 2016|1                     |70  |70      |70 |
Weight  | 2017|3                     |70  |70      |70 |
Weight  | 2018|1                     |70  |70      |70 |
Age     | 2007|2                     |19  |32      |45 |
Age     | 2009|1                     |17  |17      |17 |
Age     | 2016|1                     |49  |49      |49 |
Age     | 2017|3                     |24  |31.66   |37 |
Age     | 2018|1                     |58  |58      |58 |

我的计划是添加更多具有描述性统计信息的列,例如百分位数,例如P99。我用过 data.table 并且更愿意用它找到解决方案。我可以按计划创建一个名为 dt_incorrect 的 table,但不能按年份创建,请参见下文。我当前的代码是:

dt_incorrect <- dt[,.("Variable" = colnames(dt[,c("Length","Width","Age")]),
"Number of observations" = nrow(dt[,c("Length","Width","Age")]),
"Min" = lapply(dt[,c("Length","Width","Age")], function(x) min(x, na.rm = T)),
"Mean" = lapply(dt[,c("Length","Width","Age")], function(x) mean(x, na.rm = T)),
"Max" = lapply(dt[,c("Length","Width","Age")], function(x) max(x, na.rm = T)))]

dt_incorrect:

Variable |Number of observations| Min    |Mean    |Max |
:------  |:----                 |:------ |:------ |:---|
Length   |8                     |160     |166.25  |188 |
Weight   |8                     |65      |74.25   |90  |
Age      |8                     |17      |35.375  |58  |

预先感谢您提供有关如何解决此问题的所有建议!

类似这样的事情(假设 srctable 是你上面的起始 table):

# get the metrics for each of the columns of interest, by Year
srctable <- srctable[, 
                     lapply(.SD, function(x) c(.N, min(x, na.rm=T), mean(x, na.rm=T), max(x, na.rm=T))),
                     .SDcols=c("Length","Weight", "Age"),
                     by="Year"
                     ]

# Add a column that "labels" the metrics created
srctable[, metric:=c("N", "Min", "Mean", "Max"), by=Year]

# Use a combination of dcast and melt to rearrange
dcast(
  melt(srctable, id.vars = c("Year", "metric"), measure.vars = c("Length", "Age", "Weight")),
  Year+variable~metric,value.var = "value"
)

输出:

    Year variable Max      Mean Min N
 1: 2007   Length 172 171.00000 170 2
 2: 2007      Age  45  32.00000  19 2
 3: 2007   Weight  90  77.50000  65 2
 4: 2009   Length 188 188.00000 188 1
 5: 2009      Age  17  17.00000  17 1
 6: 2009   Weight  89  89.00000  89 1
 7: 2016   Length 160 160.00000 160 1
 8: 2016      Age  49  49.00000  49 1
 9: 2016   Weight  70  70.00000  70 1
10: 2017   Length 160 160.00000 160 3
11: 2017      Age  37  31.66667  24 3
12: 2017   Weight  70  70.00000  70 3
13: 2018   Length 160 160.00000 160 1
14: 2018      Age  58  58.00000  58 1
15: 2018   Weight  70  70.00000  70 1

简单-data.table:

library(data.table)
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, .(Num = .N, Min = min(value), Mean = mean(value), Max = max(value)), 
     by = .(Variable, Year)]
#         Variable  Year   Num   Min      Mean   Max
#           <fctr> <int> <int> <int>     <num> <int>
#  1:       Length  2009     1   188 188.00000   188
#  2:       Length  2007     2   170 171.00000   172
#  3:       Length  2017     3   160 160.00000   160
#  4:       Length  2018     1   160 160.00000   160
#  5:       Length  2016     1   160 160.00000   160
#  6:       Weight  2009     1    89  89.00000    89
#  7:       Weight  2007     2    65  77.50000    90
#  8:       Weight  2017     3    70  70.00000    70
#  9:       Weight  2018     1    70  70.00000    70
# 10:       Weight  2016     1    70  70.00000    70
# 11: Athlete_flag  2009     1     1   1.00000     1
# 12: Athlete_flag  2007     2     0   0.50000     1
# 13: Athlete_flag  2017     3     0   0.00000     0
# 14: Athlete_flag  2018     1     0   0.00000     0
# 15: Athlete_flag  2016     1     0   0.00000     0
# 16:          Age  2009     1    17  17.00000    17
# 17:          Age  2007     2    19  32.00000    45
# 18:          Age  2017     3    24  31.66667    37
# 19:          Age  2018     1    58  58.00000    58
# 20:          Age  2016     1    49  49.00000    49
#         Variable  Year   Num   Min      Mean   Max

如果统计数据的数量有些变化,您可以有效地做同样的事情,但在函数列表上使用 lapply

melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, lapply(list(Num=length, Min=min, Mean=mean, Max=max),
             function(f) f(value)),
    by = .(Variable, Year)]

只要所有参数都采用相同的参数,这种方法就可以正常工作;例如,最后三个接受 na.rm=TRUElength 不接受。在这种情况下,可以用

缩短一点
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, c(Num = .N, lapply(list(Min=min, Mean=mean, Max=max), 
                         function(f) f(value, na.rm=TRUE))),
     by = .(Variable, Year)]

(如果需要 na.rm=)。例如,这支持任意使用大多数基本统计函数,包括 medianvar/sd,尽管 quantile 仍然需要另一个参数 probs=。如果需要,即使这也不难做到。

还有许多其他方法可以调整此方法(应用于函数列表而不是数据列表)。


数据:

dt <- setDT(structure(list(Gender = c("M", "F", "M", "M", "F", "M", "F", "F"), Year = c(2009L, 2007L, 2007L, 2017L, 2017L, 2018L, 2016L, 2017L), Length = c(188L, 170L, 172L, 160L, 160L, 160L, 160L, 160L), Weight = c(89L, 65L, 90L, 70L, 70L, 70L, 70L, 70L), Athlete_flag = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Age = c(17L, 19L, 45L, 34L, 24L, 58L, 49L, 37L)), row.names = c(NA, -8L), class = c("data.table", "data.frame")))