如何避免大数据集的慢循环?
How do I avoid a slow loop with large data set?
考虑这个数据集:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
如您所见,一半的行具有重复信息。对于像这样的小数据集,删除重复项真的很容易。我可以使用 coalesce
函数 (dplyr package),删除 "action" 列,然后删除所有不相关的行。虽然,还有许多其他方法。最终结果应如下所示:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
问题是,我的真实数据集要大得多 (102000 x 270),而且变量更多。真实数据也更不规则,存在更多缺失值。 coalesce
函数似乎很慢。到目前为止我能做的最好的循环仍然需要 5-10 分钟才能达到 运行.
有没有更快的简单方法?我感觉 R 中一定有某种函数可以进行这种操作,但我找不到。
我认为你需要 dcast
。 data.table
库中的版本自称为 "fast",根据我的经验,它在大型数据集上速度很快。
首先,让我们创建一个列,它是 signature_date
或 ratification_date
,具体取决于操作
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
现在,让我们转换它,以便操作是列,值是日期
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
这么宽看起来像这样
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
OP 告诉他他的生产数据有 10 万行 x 270 列,速度是他关心的问题。因此,我建议使用 data.table
.
我知道 也建议使用 data.table
和 dcast()
但下面的解决方案是另一种方法。它以正确的顺序排列行并将 ratification_date
复制到签名行。经过一些清理后,我们得到了想要的结果。
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
数据
OP 提到他的生产数据有 270 列。为了模拟这一点,我添加了两个虚拟列:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
请注意,set.seed()
用于采样时的可重复结果。
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
附录:dcast()
有额外的列
建议使用 data.table
和 dcast()
。除了他的回答中的其他几个缺陷外,它不处理 OP 提到的其他列。
下面的 dcast()
方法还将 return 附加列:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
请注意,此方法会更改列的顺序。
这是使用 uwe-block 的 data.frame 的另一个 data.table
解决方案。它类似于uwe-block的方法,但是使用max
来折叠数据。
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
lapply
遍历by语句中不包含的变量,计算去掉NA值后的最大值。链中的下一个 link 删除不需要的操作变量,最后(不必要的)link 打印输出。
这个returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
考虑这个数据集:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
如您所见,一半的行具有重复信息。对于像这样的小数据集,删除重复项真的很容易。我可以使用 coalesce
函数 (dplyr package),删除 "action" 列,然后删除所有不相关的行。虽然,还有许多其他方法。最终结果应如下所示:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
问题是,我的真实数据集要大得多 (102000 x 270),而且变量更多。真实数据也更不规则,存在更多缺失值。 coalesce
函数似乎很慢。到目前为止我能做的最好的循环仍然需要 5-10 分钟才能达到 运行.
有没有更快的简单方法?我感觉 R 中一定有某种函数可以进行这种操作,但我找不到。
我认为你需要 dcast
。 data.table
库中的版本自称为 "fast",根据我的经验,它在大型数据集上速度很快。
首先,让我们创建一个列,它是 signature_date
或 ratification_date
,具体取决于操作
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
现在,让我们转换它,以便操作是列,值是日期
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
这么宽看起来像这样
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
OP 告诉他他的生产数据有 10 万行 x 270 列,速度是他关心的问题。因此,我建议使用 data.table
.
我知道 data.table
和 dcast()
但下面的解决方案是另一种方法。它以正确的顺序排列行并将 ratification_date
复制到签名行。经过一些清理后,我们得到了想要的结果。
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2 1: 1 Canada 2000 2001 2 D 2: 1 USA 2000 2002 3 A 3: 2 Canada 2001 2001 1 B 4: 2 USA 2002 2002 4 C
数据
OP 提到他的生产数据有 270 列。为了模拟这一点,我添加了两个虚拟列:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
请注意,set.seed()
用于采样时的可重复结果。
Agreement_number country action signature_date ratification_date dummy1 dummy2 1 1 Canada signature 2000 NA 2 D 2 1 Canada ratification NA 2001 2 D 3 1 USA signature 2000 NA 3 A 4 1 USA ratification NA 2002 3 A 5 2 Canada signature 2001 NA 1 B 6 2 Canada ratification NA 2001 1 B 7 2 USA signature 2002 NA 4 C 8 2 USA ratification NA 2002 4 C
附录:dcast()
有额外的列
data.table
和 dcast()
。除了他的回答中的其他几个缺陷外,它不处理 OP 提到的其他列。
下面的 dcast()
方法还将 return 附加列:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification 1: 1 Canada 2 D 2000 2001 2: 1 USA 3 A 2000 2002 3: 2 Canada 1 B 2001 2001 4: 2 USA 4 C 2002 2002
请注意,此方法会更改列的顺序。
这是使用 uwe-block 的 data.frame 的另一个 data.table
解决方案。它类似于uwe-block的方法,但是使用max
来折叠数据。
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
lapply
遍历by语句中不包含的变量,计算去掉NA值后的最大值。链中的下一个 link 删除不需要的操作变量,最后(不必要的)link 打印输出。
这个returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C