R如何向量化依赖于其他观察的函数
R How vectorize a function that depends on other observations
你好,我有一个数据集如下:
set.seed(100)
library(microbenchmark)
City=c("City1","City2","City2","City1","City2","City1","City2","City1")
Business=c("B","A","B","A","C","A","E","F")
SomeNumber=c(35,20,15,19,12,40,36,28)
zz=data.frame(City,Business,SomeNumber)
zz_new=do.call("rbind", replicate(1000,zz, simplify = FALSE))
zz_new$BusinessMax=0 #Initializing final variable of interest at 0
我只是将数据帧 zz 的行复制 1000 次以稍后测量性能。
我还有一个自定义函数如下:
City1=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City & !full_data$Business==observation$Business),]
NewSet2=max(NewSet$SomeNumber)
return(NewSet2)
}
我想做的是仅将自定义函数应用于 zz_new 中 City==City1 的那些行。
我可以创建一个逻辑对象 i1 来存储特定行是否满足如下条件:
i1 <- zz_new[["City"]] == "City1"
接下来,这是我需要帮助的地方,我写了一个for循环(占用了这么长时间)如下:
for (i in 1:nrow(zz_new[i1,])){
zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
}
zz_new[i1,]
以上代码提供了正确答案。但是,它极其缓慢且效率低下。我 运行 微基准测试并获得:
microbenchmark(
for (i in 1:nrow(zz_new[i1,])){
zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
},times = 5)
min lq mean median uq max neval
4.369269 4.400759 4.433388 4.401734 4.450246 4.54493 5
我应该如何对函数 City1 进行向量化?在我的实际代码中,我需要在函数 City1 中进行多个条件检查(这里我刚刚使用了两列 City 和 Business 来对数据进行子集化,但我需要包括其他几个变量)。 SO 上的许多矢量化代码仅使用来自给定行的信息。不幸的是,就我而言,我需要组合来自给定行和数据集的信息。任何帮助将不胜感激。提前致谢。
编辑 1:
函数说明 City1:
1st 它创建了一个子集,该子集保留 那些观察结果,其中提供的观察结果的“城市”与数据集的城市相同。从这个子集中,它 删除 那些观察的“业务”与数据的“业务”相同的观察。例如。如果提供的观察的“City”和“Business”分别是 City1 和 A,则子集将只考虑 City == City1 且 Business 不等于 A 的那些观察。
我还需要为其他城市创建其他类似的功能。但是如果有人可以帮助我矢量化 City1,我可以尝试对其他函数执行相同的操作。
编辑 2:
作为例子,我为City == City2写了一个替代函数如下:
City2=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City & full_data$Business==observation$Business),]
NewSet2=max(NewSet$SomeNumber)-(10*rnorm(1))
return(NewSet2)
}
在上面的函数中,请注意,与 City1 相比,我删除了“!” NewSet 中的符号并从值 NewSet2 中减去 (-10*rnorm)。
接下来,我运行它仅用于观察城市== City2。
i2 <- zz_new[["City"]] == "City2"
for (i in 1:nrow(zz_new[i2,])){
zz_new[i2,][i,"BusinessMax"]=City2(full_data=zz_new, observation = zz_new[i2,][i,])
}
这是一个快速版本,可以完成 City1()
的 for
循环。好像你想在每个城市都这样做,所以我就这样做了。
library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)
# calculate the max for each business, by city, in City1 only
biz_max = zzdt[, .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max =
biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
Business != i.Business,
.(BusinessMax = max(i.BusinessMax)),
by = .(City, Business)
]
# join back to the original data
result = zzdt[other_biz_max]
如果我们只想将此应用到 City == "City1"
,我们可以在第一步中进行过滤并使最终连接成为完整连接 - 其余部分保持不变。
library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)
# calculate the max for each business in City1
biz_max = zzdt[City == "City1", .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max =
biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
Business != i.Business,
.(BusinessMax = max(i.BusinessMax)),
by = .(City, Business)
]
# join back to the original data
result = merge(zzdt, other_biz_max, by = c("City", "Business"), all = TRUE)
在我的电脑上,data.table
方法需要 0.03 秒,而你问题中的方法需要 10.28 秒,加速大约 300 倍。我当时包括了 data.table 转换和键设置,但是如果你使用 data.table 和那个键,你的其余代码也可以加快速度。
你好,我有一个数据集如下:
set.seed(100)
library(microbenchmark)
City=c("City1","City2","City2","City1","City2","City1","City2","City1")
Business=c("B","A","B","A","C","A","E","F")
SomeNumber=c(35,20,15,19,12,40,36,28)
zz=data.frame(City,Business,SomeNumber)
zz_new=do.call("rbind", replicate(1000,zz, simplify = FALSE))
zz_new$BusinessMax=0 #Initializing final variable of interest at 0
我只是将数据帧 zz 的行复制 1000 次以稍后测量性能。
我还有一个自定义函数如下:
City1=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City & !full_data$Business==observation$Business),]
NewSet2=max(NewSet$SomeNumber)
return(NewSet2)
}
我想做的是仅将自定义函数应用于 zz_new 中 City==City1 的那些行。 我可以创建一个逻辑对象 i1 来存储特定行是否满足如下条件:
i1 <- zz_new[["City"]] == "City1"
接下来,这是我需要帮助的地方,我写了一个for循环(占用了这么长时间)如下:
for (i in 1:nrow(zz_new[i1,])){
zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
}
zz_new[i1,]
以上代码提供了正确答案。但是,它极其缓慢且效率低下。我 运行 微基准测试并获得:
microbenchmark(
for (i in 1:nrow(zz_new[i1,])){
zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
},times = 5)
min lq mean median uq max neval
4.369269 4.400759 4.433388 4.401734 4.450246 4.54493 5
我应该如何对函数 City1 进行向量化?在我的实际代码中,我需要在函数 City1 中进行多个条件检查(这里我刚刚使用了两列 City 和 Business 来对数据进行子集化,但我需要包括其他几个变量)。 SO 上的许多矢量化代码仅使用来自给定行的信息。不幸的是,就我而言,我需要组合来自给定行和数据集的信息。任何帮助将不胜感激。提前致谢。
编辑 1:
函数说明 City1:
1st 它创建了一个子集,该子集保留 那些观察结果,其中提供的观察结果的“城市”与数据集的城市相同。从这个子集中,它 删除 那些观察的“业务”与数据的“业务”相同的观察。例如。如果提供的观察的“City”和“Business”分别是 City1 和 A,则子集将只考虑 City == City1 且 Business 不等于 A 的那些观察。
我还需要为其他城市创建其他类似的功能。但是如果有人可以帮助我矢量化 City1,我可以尝试对其他函数执行相同的操作。
编辑 2:
作为例子,我为City == City2写了一个替代函数如下:
City2=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City & full_data$Business==observation$Business),]
NewSet2=max(NewSet$SomeNumber)-(10*rnorm(1))
return(NewSet2)
}
在上面的函数中,请注意,与 City1 相比,我删除了“!” NewSet 中的符号并从值 NewSet2 中减去 (-10*rnorm)。
接下来,我运行它仅用于观察城市== City2。
i2 <- zz_new[["City"]] == "City2"
for (i in 1:nrow(zz_new[i2,])){
zz_new[i2,][i,"BusinessMax"]=City2(full_data=zz_new, observation = zz_new[i2,][i,])
}
这是一个快速版本,可以完成 City1()
的 for
循环。好像你想在每个城市都这样做,所以我就这样做了。
library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)
# calculate the max for each business, by city, in City1 only
biz_max = zzdt[, .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max =
biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
Business != i.Business,
.(BusinessMax = max(i.BusinessMax)),
by = .(City, Business)
]
# join back to the original data
result = zzdt[other_biz_max]
如果我们只想将此应用到 City == "City1"
,我们可以在第一步中进行过滤并使最终连接成为完整连接 - 其余部分保持不变。
library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)
# calculate the max for each business in City1
biz_max = zzdt[City == "City1", .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max =
biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
Business != i.Business,
.(BusinessMax = max(i.BusinessMax)),
by = .(City, Business)
]
# join back to the original data
result = merge(zzdt, other_biz_max, by = c("City", "Business"), all = TRUE)
在我的电脑上,data.table
方法需要 0.03 秒,而你问题中的方法需要 10.28 秒,加速大约 300 倍。我当时包括了 data.table 转换和键设置,但是如果你使用 data.table 和那个键,你的其余代码也可以加快速度。