如何在包含现有行值的同时 expand/aggregate a data.table?
How to expand/aggregate a data.table while including the existing row values?
我有以下 R data.table
:
library(data.table)
dt =
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 4 22 xy28352
3: up2 FALSE 1 4 xy28352
4: up2 TRUE 0 3 xy28352
5: up3 FALSE 12 5 xy28352
6: up3 TRUE 35 7 xy28352
....
我已经格式化了 data.table,这样对于每个 unique_point
,我正在测量 unbiased
和 biased
的数据点。所以每个 unique_point
有两行,偏向 FALSE 和偏向 TRUE。如果没有测量值,则记录为 0。
例如,对于up1
,无偏实验有3个数据点,有偏实验有4个数据点。
每个 groupID
有 25 个团队,每个团队都有 biased
和 unbiased
的潜在测量值。我想重新格式化 data.table 以便它也计算团队的数据点数,对于每个唯一数据点(由于数据,这将使行的 data_points
为 0) .
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 0 1 xy28352
3: up1 FALSE 0 2 xy28352
4: up1 TRUE 0 2 xy28352
5: up1 FALSE 0 3 xy28352
6: up1 TRUE 0 3 xy28352
....
45. up1 TRUE 4 22 xy28352
....
49. up1 FALSE 0 25 xy28352
50. up1 TRUE 0 25 xy28352
这个任务在某种程度上非常接近 "unfolding" data.table。对于每个 unique_point
,我将创建 50 行,25 个带有 TRUE 和 FALSE 的团队。增加的复杂性是我需要使用上面的 counts
来填写上面的计数。
应该有一种方法可以使用 unique()
来计算行可能存在的次数?
如果我尝试
setkey(dt, team, unique_point)[CJ(unique(unique_point), unique(team)), .N, by=.EACHI]
我正在计算 unique_point
和 team
出现的行数。但这不会保留 data_points
.
使用:
DT2 <- DT[, .SD[CJ(team = 1:25, biased = biased, unique = TRUE), on = .(biased, team)], by = .(unique_point, groupID)
][is.na(data_points), data_points := 0][]
setcolorder(DT2, c(1,3:5,2))
给出:
> DT2
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 0 1 xy28352
3: up1 FALSE 0 2 xy28352
4: up1 TRUE 0 2 xy28352
5: up1 FALSE 0 3 xy28352
---
146: up3 TRUE 0 23 xy28352
147: up3 FALSE 0 24 xy28352
148: up3 TRUE 0 24 xy28352
149: up3 FALSE 0 25 xy28352
150: up3 TRUE 0 25 xy28352
这是做什么的:
- 您将
DT
分组为 unique_point
,将 groupID
分组为 by = .(unique_point, groupID)
- 其余列与
biased
和 team
. 的完整参考 table (CJ(team = 1:25, biased = biased)
) 连接
- 扩展后的数据集将有
NA
行的 DT
中不存在的值。因此,您用 [is.na(data_points), data_points := 0]
部分用零填充它们。
- 最后一对方括号 (
[]
) 不是必需的,但可以减少在控制台上打印所需的额外步骤。如需更多信息,see here.
没有必要使用 setcolorder(DT2, c(1,3:5,2))
,只有当您想要获得与问题中描述的完全相同的列顺序时才有必要。
作为替代方案,您还可以使用:
DT2 <- DT[CJ(unique_point = unique_point, biased = biased, team = 1:25, groupID = groupID, unique = TRUE),
on = .(unique_point, biased, team, groupID)
][is.na(data_points), data_points := 0][]
完整的前 60 行:
> DT2[1:60]
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 0 1 xy28352
3: up1 FALSE 0 2 xy28352
4: up1 TRUE 0 2 xy28352
5: up1 FALSE 0 3 xy28352
6: up1 TRUE 0 3 xy28352
7: up1 FALSE 0 4 xy28352
8: up1 TRUE 0 4 xy28352
9: up1 FALSE 0 5 xy28352
10: up1 TRUE 0 5 xy28352
11: up1 FALSE 0 6 xy28352
12: up1 TRUE 0 6 xy28352
13: up1 FALSE 0 7 xy28352
14: up1 TRUE 0 7 xy28352
15: up1 FALSE 0 8 xy28352
16: up1 TRUE 0 8 xy28352
17: up1 FALSE 0 9 xy28352
18: up1 TRUE 0 9 xy28352
19: up1 FALSE 0 10 xy28352
20: up1 TRUE 0 10 xy28352
21: up1 FALSE 0 11 xy28352
22: up1 TRUE 0 11 xy28352
23: up1 FALSE 0 12 xy28352
24: up1 TRUE 0 12 xy28352
25: up1 FALSE 0 13 xy28352
26: up1 TRUE 0 13 xy28352
27: up1 FALSE 0 14 xy28352
28: up1 TRUE 0 14 xy28352
29: up1 FALSE 0 15 xy28352
30: up1 TRUE 0 15 xy28352
31: up1 FALSE 0 16 xy28352
32: up1 TRUE 0 16 xy28352
33: up1 FALSE 0 17 xy28352
34: up1 TRUE 0 17 xy28352
35: up1 FALSE 0 18 xy28352
36: up1 TRUE 0 18 xy28352
37: up1 FALSE 0 19 xy28352
38: up1 TRUE 0 19 xy28352
39: up1 FALSE 0 20 xy28352
40: up1 TRUE 0 20 xy28352
41: up1 FALSE 0 21 xy28352
42: up1 TRUE 0 21 xy28352
43: up1 FALSE 0 22 xy28352
44: up1 TRUE 4 22 xy28352
45: up1 FALSE 0 23 xy28352
46: up1 TRUE 0 23 xy28352
47: up1 FALSE 0 24 xy28352
48: up1 TRUE 0 24 xy28352
49: up1 FALSE 0 25 xy28352
50: up1 TRUE 0 25 xy28352
51: up2 FALSE 0 1 xy28352
52: up2 TRUE 0 1 xy28352
53: up2 FALSE 0 2 xy28352
54: up2 TRUE 0 2 xy28352
55: up2 FALSE 0 3 xy28352
56: up2 TRUE 0 3 xy28352
57: up2 FALSE 1 4 xy28352
58: up2 TRUE 0 4 xy28352
59: up2 FALSE 0 5 xy28352
60: up2 TRUE 0 5 xy28352
已用数据:
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 1 xy28352
up1 TRUE 4 22 xy28352
up2 FALSE 1 4 xy28352
up2 TRUE 0 3 xy28352
up3 FALSE 12 5 xy28352
up3 TRUE 35 7 xy28352')
我有以下 R data.table
:
library(data.table)
dt =
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 4 22 xy28352
3: up2 FALSE 1 4 xy28352
4: up2 TRUE 0 3 xy28352
5: up3 FALSE 12 5 xy28352
6: up3 TRUE 35 7 xy28352
....
我已经格式化了 data.table,这样对于每个 unique_point
,我正在测量 unbiased
和 biased
的数据点。所以每个 unique_point
有两行,偏向 FALSE 和偏向 TRUE。如果没有测量值,则记录为 0。
例如,对于up1
,无偏实验有3个数据点,有偏实验有4个数据点。
每个 groupID
有 25 个团队,每个团队都有 biased
和 unbiased
的潜在测量值。我想重新格式化 data.table 以便它也计算团队的数据点数,对于每个唯一数据点(由于数据,这将使行的 data_points
为 0) .
unique_point biased data_points team groupID
1: up1 FALSE 3 1 xy28352
2: up1 TRUE 0 1 xy28352
3: up1 FALSE 0 2 xy28352
4: up1 TRUE 0 2 xy28352
5: up1 FALSE 0 3 xy28352
6: up1 TRUE 0 3 xy28352
....
45. up1 TRUE 4 22 xy28352
....
49. up1 FALSE 0 25 xy28352
50. up1 TRUE 0 25 xy28352
这个任务在某种程度上非常接近 "unfolding" data.table。对于每个 unique_point
,我将创建 50 行,25 个带有 TRUE 和 FALSE 的团队。增加的复杂性是我需要使用上面的 counts
来填写上面的计数。
应该有一种方法可以使用 unique()
来计算行可能存在的次数?
如果我尝试
setkey(dt, team, unique_point)[CJ(unique(unique_point), unique(team)), .N, by=.EACHI]
我正在计算 unique_point
和 team
出现的行数。但这不会保留 data_points
.
使用:
DT2 <- DT[, .SD[CJ(team = 1:25, biased = biased, unique = TRUE), on = .(biased, team)], by = .(unique_point, groupID)
][is.na(data_points), data_points := 0][]
setcolorder(DT2, c(1,3:5,2))
给出:
> DT2 unique_point biased data_points team groupID 1: up1 FALSE 3 1 xy28352 2: up1 TRUE 0 1 xy28352 3: up1 FALSE 0 2 xy28352 4: up1 TRUE 0 2 xy28352 5: up1 FALSE 0 3 xy28352 --- 146: up3 TRUE 0 23 xy28352 147: up3 FALSE 0 24 xy28352 148: up3 TRUE 0 24 xy28352 149: up3 FALSE 0 25 xy28352 150: up3 TRUE 0 25 xy28352
这是做什么的:
- 您将
DT
分组为unique_point
,将groupID
分组为by = .(unique_point, groupID)
- 其余列与
biased
和team
. 的完整参考 table ( - 扩展后的数据集将有
NA
行的DT
中不存在的值。因此,您用[is.na(data_points), data_points := 0]
部分用零填充它们。 - 最后一对方括号 (
[]
) 不是必需的,但可以减少在控制台上打印所需的额外步骤。如需更多信息,see here.
CJ(team = 1:25, biased = biased)
) 连接
没有必要使用 setcolorder(DT2, c(1,3:5,2))
,只有当您想要获得与问题中描述的完全相同的列顺序时才有必要。
作为替代方案,您还可以使用:
DT2 <- DT[CJ(unique_point = unique_point, biased = biased, team = 1:25, groupID = groupID, unique = TRUE),
on = .(unique_point, biased, team, groupID)
][is.na(data_points), data_points := 0][]
完整的前 60 行:
> DT2[1:60] unique_point biased data_points team groupID 1: up1 FALSE 3 1 xy28352 2: up1 TRUE 0 1 xy28352 3: up1 FALSE 0 2 xy28352 4: up1 TRUE 0 2 xy28352 5: up1 FALSE 0 3 xy28352 6: up1 TRUE 0 3 xy28352 7: up1 FALSE 0 4 xy28352 8: up1 TRUE 0 4 xy28352 9: up1 FALSE 0 5 xy28352 10: up1 TRUE 0 5 xy28352 11: up1 FALSE 0 6 xy28352 12: up1 TRUE 0 6 xy28352 13: up1 FALSE 0 7 xy28352 14: up1 TRUE 0 7 xy28352 15: up1 FALSE 0 8 xy28352 16: up1 TRUE 0 8 xy28352 17: up1 FALSE 0 9 xy28352 18: up1 TRUE 0 9 xy28352 19: up1 FALSE 0 10 xy28352 20: up1 TRUE 0 10 xy28352 21: up1 FALSE 0 11 xy28352 22: up1 TRUE 0 11 xy28352 23: up1 FALSE 0 12 xy28352 24: up1 TRUE 0 12 xy28352 25: up1 FALSE 0 13 xy28352 26: up1 TRUE 0 13 xy28352 27: up1 FALSE 0 14 xy28352 28: up1 TRUE 0 14 xy28352 29: up1 FALSE 0 15 xy28352 30: up1 TRUE 0 15 xy28352 31: up1 FALSE 0 16 xy28352 32: up1 TRUE 0 16 xy28352 33: up1 FALSE 0 17 xy28352 34: up1 TRUE 0 17 xy28352 35: up1 FALSE 0 18 xy28352 36: up1 TRUE 0 18 xy28352 37: up1 FALSE 0 19 xy28352 38: up1 TRUE 0 19 xy28352 39: up1 FALSE 0 20 xy28352 40: up1 TRUE 0 20 xy28352 41: up1 FALSE 0 21 xy28352 42: up1 TRUE 0 21 xy28352 43: up1 FALSE 0 22 xy28352 44: up1 TRUE 4 22 xy28352 45: up1 FALSE 0 23 xy28352 46: up1 TRUE 0 23 xy28352 47: up1 FALSE 0 24 xy28352 48: up1 TRUE 0 24 xy28352 49: up1 FALSE 0 25 xy28352 50: up1 TRUE 0 25 xy28352 51: up2 FALSE 0 1 xy28352 52: up2 TRUE 0 1 xy28352 53: up2 FALSE 0 2 xy28352 54: up2 TRUE 0 2 xy28352 55: up2 FALSE 0 3 xy28352 56: up2 TRUE 0 3 xy28352 57: up2 FALSE 1 4 xy28352 58: up2 TRUE 0 4 xy28352 59: up2 FALSE 0 5 xy28352 60: up2 TRUE 0 5 xy28352
已用数据:
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 1 xy28352
up1 TRUE 4 22 xy28352
up2 FALSE 1 4 xy28352
up2 TRUE 0 3 xy28352
up3 FALSE 12 5 xy28352
up3 TRUE 35 7 xy28352')