R:通过转置和使用两组列来构造数据table
R: construct data table by transposing and using two group columns
我有一个名为dt的数据table,如下所示(有点类似)。
dt:
| Gender| Year | Length| Weight | Athlete_flag | Age|
|:-- |:---- |:------|:------ |:------------ |:---|
| M | 2009 | 188 | 89 | 1 |17 |
| F | 2007 | 170 | 65 | 1 |19 |
| M |2007 |172 |90 |0 |45 |
|M |2017 |160 |70 |0 |34 |
|F |2017 |160 |70 |0 |24 |
|M |2018 |160 |70 |0 |58 |
|F |2016 |160 |70 |0 |49 |
|F |2017 |160 |70 |0 |37 |
我想创建一个名为 dt_new 的数据 table,其中包含前一个 table 中特定列按年份的描述性统计信息。我会有两个组列,“变量”和“年”。换句话说,table 如下所示:
dt_new:
Variable|Year |Number of observations| Min|Mean |Max|
:------ |:----|:---- |:---|:------ |:--|
Length | 2007|2 |170 |171 |172|
Length | 2009|1 |188 |188 |188|
Length | 2016|1 |160 |160 |162|
Length | 2017|3 |160 |160 |160|
Length | 2018|1 |160 |160 |160|
Weight | 2007|2 |65 |77.5 |90 |
Weight | 2009|1 |89 |89 |89 |
Weight | 2016|1 |70 |70 |70 |
Weight | 2017|3 |70 |70 |70 |
Weight | 2018|1 |70 |70 |70 |
Age | 2007|2 |19 |32 |45 |
Age | 2009|1 |17 |17 |17 |
Age | 2016|1 |49 |49 |49 |
Age | 2017|3 |24 |31.66 |37 |
Age | 2018|1 |58 |58 |58 |
我的计划是添加更多具有描述性统计信息的列,例如百分位数,例如P99。我用过 data.table 并且更愿意用它找到解决方案。我可以按计划创建一个名为 dt_incorrect 的 table,但不能按年份创建,请参见下文。我当前的代码是:
dt_incorrect <- dt[,.("Variable" = colnames(dt[,c("Length","Width","Age")]),
"Number of observations" = nrow(dt[,c("Length","Width","Age")]),
"Min" = lapply(dt[,c("Length","Width","Age")], function(x) min(x, na.rm = T)),
"Mean" = lapply(dt[,c("Length","Width","Age")], function(x) mean(x, na.rm = T)),
"Max" = lapply(dt[,c("Length","Width","Age")], function(x) max(x, na.rm = T)))]
dt_incorrect:
Variable |Number of observations| Min |Mean |Max |
:------ |:---- |:------ |:------ |:---|
Length |8 |160 |166.25 |188 |
Weight |8 |65 |74.25 |90 |
Age |8 |17 |35.375 |58 |
预先感谢您提供有关如何解决此问题的所有建议!
类似这样的事情(假设 srctable
是你上面的起始 table):
# get the metrics for each of the columns of interest, by Year
srctable <- srctable[,
lapply(.SD, function(x) c(.N, min(x, na.rm=T), mean(x, na.rm=T), max(x, na.rm=T))),
.SDcols=c("Length","Weight", "Age"),
by="Year"
]
# Add a column that "labels" the metrics created
srctable[, metric:=c("N", "Min", "Mean", "Max"), by=Year]
# Use a combination of dcast and melt to rearrange
dcast(
melt(srctable, id.vars = c("Year", "metric"), measure.vars = c("Length", "Age", "Weight")),
Year+variable~metric,value.var = "value"
)
输出:
Year variable Max Mean Min N
1: 2007 Length 172 171.00000 170 2
2: 2007 Age 45 32.00000 19 2
3: 2007 Weight 90 77.50000 65 2
4: 2009 Length 188 188.00000 188 1
5: 2009 Age 17 17.00000 17 1
6: 2009 Weight 89 89.00000 89 1
7: 2016 Length 160 160.00000 160 1
8: 2016 Age 49 49.00000 49 1
9: 2016 Weight 70 70.00000 70 1
10: 2017 Length 160 160.00000 160 3
11: 2017 Age 37 31.66667 24 3
12: 2017 Weight 70 70.00000 70 3
13: 2018 Length 160 160.00000 160 1
14: 2018 Age 58 58.00000 58 1
15: 2018 Weight 70 70.00000 70 1
简单-data.table
:
library(data.table)
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, .(Num = .N, Min = min(value), Mean = mean(value), Max = max(value)),
by = .(Variable, Year)]
# Variable Year Num Min Mean Max
# <fctr> <int> <int> <int> <num> <int>
# 1: Length 2009 1 188 188.00000 188
# 2: Length 2007 2 170 171.00000 172
# 3: Length 2017 3 160 160.00000 160
# 4: Length 2018 1 160 160.00000 160
# 5: Length 2016 1 160 160.00000 160
# 6: Weight 2009 1 89 89.00000 89
# 7: Weight 2007 2 65 77.50000 90
# 8: Weight 2017 3 70 70.00000 70
# 9: Weight 2018 1 70 70.00000 70
# 10: Weight 2016 1 70 70.00000 70
# 11: Athlete_flag 2009 1 1 1.00000 1
# 12: Athlete_flag 2007 2 0 0.50000 1
# 13: Athlete_flag 2017 3 0 0.00000 0
# 14: Athlete_flag 2018 1 0 0.00000 0
# 15: Athlete_flag 2016 1 0 0.00000 0
# 16: Age 2009 1 17 17.00000 17
# 17: Age 2007 2 19 32.00000 45
# 18: Age 2017 3 24 31.66667 37
# 19: Age 2018 1 58 58.00000 58
# 20: Age 2016 1 49 49.00000 49
# Variable Year Num Min Mean Max
如果统计数据的数量有些变化,您可以有效地做同样的事情,但在函数列表上使用 lapply
:
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, lapply(list(Num=length, Min=min, Mean=mean, Max=max),
function(f) f(value)),
by = .(Variable, Year)]
只要所有参数都采用相同的参数,这种方法就可以正常工作;例如,最后三个接受 na.rm=TRUE
但 length
不接受。在这种情况下,可以用
缩短一点
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, c(Num = .N, lapply(list(Min=min, Mean=mean, Max=max),
function(f) f(value, na.rm=TRUE))),
by = .(Variable, Year)]
(如果需要 na.rm=
)。例如,这支持任意使用大多数基本统计函数,包括 median
和 var
/sd
,尽管 quantile
仍然需要另一个参数 probs=
。如果需要,即使这也不难做到。
还有许多其他方法可以调整此方法(应用于函数列表而不是数据列表)。
数据:
dt <- setDT(structure(list(Gender = c("M", "F", "M", "M", "F", "M", "F", "F"), Year = c(2009L, 2007L, 2007L, 2017L, 2017L, 2018L, 2016L, 2017L), Length = c(188L, 170L, 172L, 160L, 160L, 160L, 160L, 160L), Weight = c(89L, 65L, 90L, 70L, 70L, 70L, 70L, 70L), Athlete_flag = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Age = c(17L, 19L, 45L, 34L, 24L, 58L, 49L, 37L)), row.names = c(NA, -8L), class = c("data.table", "data.frame")))
我有一个名为dt的数据table,如下所示(有点类似)。
dt:
| Gender| Year | Length| Weight | Athlete_flag | Age|
|:-- |:---- |:------|:------ |:------------ |:---|
| M | 2009 | 188 | 89 | 1 |17 |
| F | 2007 | 170 | 65 | 1 |19 |
| M |2007 |172 |90 |0 |45 |
|M |2017 |160 |70 |0 |34 |
|F |2017 |160 |70 |0 |24 |
|M |2018 |160 |70 |0 |58 |
|F |2016 |160 |70 |0 |49 |
|F |2017 |160 |70 |0 |37 |
我想创建一个名为 dt_new 的数据 table,其中包含前一个 table 中特定列按年份的描述性统计信息。我会有两个组列,“变量”和“年”。换句话说,table 如下所示:
dt_new:
Variable|Year |Number of observations| Min|Mean |Max|
:------ |:----|:---- |:---|:------ |:--|
Length | 2007|2 |170 |171 |172|
Length | 2009|1 |188 |188 |188|
Length | 2016|1 |160 |160 |162|
Length | 2017|3 |160 |160 |160|
Length | 2018|1 |160 |160 |160|
Weight | 2007|2 |65 |77.5 |90 |
Weight | 2009|1 |89 |89 |89 |
Weight | 2016|1 |70 |70 |70 |
Weight | 2017|3 |70 |70 |70 |
Weight | 2018|1 |70 |70 |70 |
Age | 2007|2 |19 |32 |45 |
Age | 2009|1 |17 |17 |17 |
Age | 2016|1 |49 |49 |49 |
Age | 2017|3 |24 |31.66 |37 |
Age | 2018|1 |58 |58 |58 |
我的计划是添加更多具有描述性统计信息的列,例如百分位数,例如P99。我用过 data.table 并且更愿意用它找到解决方案。我可以按计划创建一个名为 dt_incorrect 的 table,但不能按年份创建,请参见下文。我当前的代码是:
dt_incorrect <- dt[,.("Variable" = colnames(dt[,c("Length","Width","Age")]),
"Number of observations" = nrow(dt[,c("Length","Width","Age")]),
"Min" = lapply(dt[,c("Length","Width","Age")], function(x) min(x, na.rm = T)),
"Mean" = lapply(dt[,c("Length","Width","Age")], function(x) mean(x, na.rm = T)),
"Max" = lapply(dt[,c("Length","Width","Age")], function(x) max(x, na.rm = T)))]
dt_incorrect:
Variable |Number of observations| Min |Mean |Max |
:------ |:---- |:------ |:------ |:---|
Length |8 |160 |166.25 |188 |
Weight |8 |65 |74.25 |90 |
Age |8 |17 |35.375 |58 |
预先感谢您提供有关如何解决此问题的所有建议!
类似这样的事情(假设 srctable
是你上面的起始 table):
# get the metrics for each of the columns of interest, by Year
srctable <- srctable[,
lapply(.SD, function(x) c(.N, min(x, na.rm=T), mean(x, na.rm=T), max(x, na.rm=T))),
.SDcols=c("Length","Weight", "Age"),
by="Year"
]
# Add a column that "labels" the metrics created
srctable[, metric:=c("N", "Min", "Mean", "Max"), by=Year]
# Use a combination of dcast and melt to rearrange
dcast(
melt(srctable, id.vars = c("Year", "metric"), measure.vars = c("Length", "Age", "Weight")),
Year+variable~metric,value.var = "value"
)
输出:
Year variable Max Mean Min N
1: 2007 Length 172 171.00000 170 2
2: 2007 Age 45 32.00000 19 2
3: 2007 Weight 90 77.50000 65 2
4: 2009 Length 188 188.00000 188 1
5: 2009 Age 17 17.00000 17 1
6: 2009 Weight 89 89.00000 89 1
7: 2016 Length 160 160.00000 160 1
8: 2016 Age 49 49.00000 49 1
9: 2016 Weight 70 70.00000 70 1
10: 2017 Length 160 160.00000 160 3
11: 2017 Age 37 31.66667 24 3
12: 2017 Weight 70 70.00000 70 3
13: 2018 Length 160 160.00000 160 1
14: 2018 Age 58 58.00000 58 1
15: 2018 Weight 70 70.00000 70 1
简单-data.table
:
library(data.table)
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, .(Num = .N, Min = min(value), Mean = mean(value), Max = max(value)),
by = .(Variable, Year)]
# Variable Year Num Min Mean Max
# <fctr> <int> <int> <int> <num> <int>
# 1: Length 2009 1 188 188.00000 188
# 2: Length 2007 2 170 171.00000 172
# 3: Length 2017 3 160 160.00000 160
# 4: Length 2018 1 160 160.00000 160
# 5: Length 2016 1 160 160.00000 160
# 6: Weight 2009 1 89 89.00000 89
# 7: Weight 2007 2 65 77.50000 90
# 8: Weight 2017 3 70 70.00000 70
# 9: Weight 2018 1 70 70.00000 70
# 10: Weight 2016 1 70 70.00000 70
# 11: Athlete_flag 2009 1 1 1.00000 1
# 12: Athlete_flag 2007 2 0 0.50000 1
# 13: Athlete_flag 2017 3 0 0.00000 0
# 14: Athlete_flag 2018 1 0 0.00000 0
# 15: Athlete_flag 2016 1 0 0.00000 0
# 16: Age 2009 1 17 17.00000 17
# 17: Age 2007 2 19 32.00000 45
# 18: Age 2017 3 24 31.66667 37
# 19: Age 2018 1 58 58.00000 58
# 20: Age 2016 1 49 49.00000 49
# Variable Year Num Min Mean Max
如果统计数据的数量有些变化,您可以有效地做同样的事情,但在函数列表上使用 lapply
:
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, lapply(list(Num=length, Min=min, Mean=mean, Max=max),
function(f) f(value)),
by = .(Variable, Year)]
只要所有参数都采用相同的参数,这种方法就可以正常工作;例如,最后三个接受 na.rm=TRUE
但 length
不接受。在这种情况下,可以用
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, c(Num = .N, lapply(list(Min=min, Mean=mean, Max=max),
function(f) f(value, na.rm=TRUE))),
by = .(Variable, Year)]
(如果需要 na.rm=
)。例如,这支持任意使用大多数基本统计函数,包括 median
和 var
/sd
,尽管 quantile
仍然需要另一个参数 probs=
。如果需要,即使这也不难做到。
还有许多其他方法可以调整此方法(应用于函数列表而不是数据列表)。
数据:
dt <- setDT(structure(list(Gender = c("M", "F", "M", "M", "F", "M", "F", "F"), Year = c(2009L, 2007L, 2007L, 2017L, 2017L, 2018L, 2016L, 2017L), Length = c(188L, 170L, 172L, 160L, 160L, 160L, 160L, 160L), Weight = c(89L, 65L, 90L, 70L, 70L, 70L, 70L, 70L), Athlete_flag = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Age = c(17L, 19L, 45L, 34L, 24L, 58L, 49L, 37L)), row.names = c(NA, -8L), class = c("data.table", "data.frame")))