省略数据框中元素相同的行
Omitting the rows of a data frame in which their elements are the same
假设我们有这样一个数据框
DataFrame ref = DataFrame::create( Named("sender") = sender , Named("receiver") = receiver);
对应的R代码如下:
edge <- as.data.frame(edge) %>%
set_colnames(c("time", "sender", "receiver"))
edge <- rbind(c(0,0,0), edge)
ref <- data.frame(sender = rep(1:n, times = n),
receiver = rep(1:n, each = n)
) %>%
filter(sender != receiver) %>%
mutate(teller = 1:(n*(n-1)))
此数据框中的某些行具有相同的元素,例如 2 2,我想找到它们并将它们从数据框中删除。然后我想向这个数据框添加另一列,就像从 1 到新数据框的行数的数字。
示例:
我认为这个问题可以解释为 的重复,但我在这里单独回答以证明我在评论中的观点,如果你这样做是为了提高性能,Rcpp
可能不是完成此 特定 任务的方法。也就是说,有很多任务 Rcpp
是我提高性能的地方,但对数据帧的行进行子集化并不是这些任务之一。
代码很容易设置,遵循我链接的答案中的方法:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame foo(Rcpp::DataFrame x) {
Rcpp::NumericVector sender = x["sender"];
Rcpp::NumericVector receiver = x["receiver"];
Rcpp::LogicalVector indices = sender != receiver;
return Rcpp::DataFrame::create(Rcpp::Named("sender") = sender[indices],
Rcpp::Named("receiver") = receiver[indices]);
}
但是,我们可以看到这个执行速度实际上比base R差(和data.table
可以略微超越基础R的性能):
library(dplyr)
library(Rcpp)
library(microbenchmark)
library(data.table)
sourceCpp("so.cpp")
for ( n in 10^(1:3) ) {
ref <- data.frame(sender = rep(1:n, times = n), ## If you're using
receiver = rep(1:n, each = n)) ## data frames
refDT <- setDT(ref) ## If you're using data.table
cat("For n =", n, "(a data frame with", nrow(ref), "rows)\n")
print(microbenchmark(base = ref[ref$sender != ref$receiver, ],
dplyr = ref %>% filter(sender != receiver),
rcpp = foo(ref),
data.table = refDT[sender != receiver]))
cat("\n")
}
For n = 10 (a data frame with 100 rows)
Unit: microseconds
expr min lq mean median uq max neval
base 123.917 140.0025 160.7615 155.1905 170.7825 302.520 100
dplyr 397.308 430.7595 478.0543 446.9185 492.5705 900.716 100
rcpp 189.473 212.9530 238.8270 223.3305 240.7950 461.452 100
data.table 122.436 135.9185 160.6607 154.0565 166.7825 460.739 100
For n = 100 (a data frame with 10000 rows)
Unit: microseconds
expr min lq mean median uq max neval
base 205.978 224.9760 250.7321 244.3315 265.5060 510.079 100
dplyr 519.276 581.4535 629.2837 615.7095 662.8060 989.698 100
rcpp 369.276 430.3510 463.1586 471.3195 486.4450 736.907 100
data.table 198.012 221.8445 248.9371 246.2385 267.5325 341.935 100
For n = 1000 (a data frame with 1000000 rows)
Unit: milliseconds
expr min lq mean median uq max
base 6.535990 6.892702 7.664697 7.203983 7.554144 11.42160
dplyr 8.795884 9.239173 10.024997 9.618395 9.992066 15.04914
rcpp 15.116928 15.598556 17.164895 16.216766 17.066418 30.45578
data.table 6.624728 6.905202 7.543284 7.137171 7.482922 11.67061
neval
100
100
100
100
假设我们有这样一个数据框
DataFrame ref = DataFrame::create( Named("sender") = sender , Named("receiver") = receiver);
对应的R代码如下:
edge <- as.data.frame(edge) %>%
set_colnames(c("time", "sender", "receiver"))
edge <- rbind(c(0,0,0), edge)
ref <- data.frame(sender = rep(1:n, times = n),
receiver = rep(1:n, each = n)
) %>%
filter(sender != receiver) %>%
mutate(teller = 1:(n*(n-1)))
此数据框中的某些行具有相同的元素,例如 2 2,我想找到它们并将它们从数据框中删除。然后我想向这个数据框添加另一列,就像从 1 到新数据框的行数的数字。
示例:
我认为这个问题可以解释为 Rcpp
可能不是完成此 特定 任务的方法。也就是说,有很多任务 Rcpp
是我提高性能的地方,但对数据帧的行进行子集化并不是这些任务之一。
代码很容易设置,遵循我链接的答案中的方法:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame foo(Rcpp::DataFrame x) {
Rcpp::NumericVector sender = x["sender"];
Rcpp::NumericVector receiver = x["receiver"];
Rcpp::LogicalVector indices = sender != receiver;
return Rcpp::DataFrame::create(Rcpp::Named("sender") = sender[indices],
Rcpp::Named("receiver") = receiver[indices]);
}
但是,我们可以看到这个执行速度实际上比base R差(和data.table
可以略微超越基础R的性能):
library(dplyr)
library(Rcpp)
library(microbenchmark)
library(data.table)
sourceCpp("so.cpp")
for ( n in 10^(1:3) ) {
ref <- data.frame(sender = rep(1:n, times = n), ## If you're using
receiver = rep(1:n, each = n)) ## data frames
refDT <- setDT(ref) ## If you're using data.table
cat("For n =", n, "(a data frame with", nrow(ref), "rows)\n")
print(microbenchmark(base = ref[ref$sender != ref$receiver, ],
dplyr = ref %>% filter(sender != receiver),
rcpp = foo(ref),
data.table = refDT[sender != receiver]))
cat("\n")
}
For n = 10 (a data frame with 100 rows)
Unit: microseconds
expr min lq mean median uq max neval
base 123.917 140.0025 160.7615 155.1905 170.7825 302.520 100
dplyr 397.308 430.7595 478.0543 446.9185 492.5705 900.716 100
rcpp 189.473 212.9530 238.8270 223.3305 240.7950 461.452 100
data.table 122.436 135.9185 160.6607 154.0565 166.7825 460.739 100
For n = 100 (a data frame with 10000 rows)
Unit: microseconds
expr min lq mean median uq max neval
base 205.978 224.9760 250.7321 244.3315 265.5060 510.079 100
dplyr 519.276 581.4535 629.2837 615.7095 662.8060 989.698 100
rcpp 369.276 430.3510 463.1586 471.3195 486.4450 736.907 100
data.table 198.012 221.8445 248.9371 246.2385 267.5325 341.935 100
For n = 1000 (a data frame with 1000000 rows)
Unit: milliseconds
expr min lq mean median uq max
base 6.535990 6.892702 7.664697 7.203983 7.554144 11.42160
dplyr 8.795884 9.239173 10.024997 9.618395 9.992066 15.04914
rcpp 15.116928 15.598556 17.164895 16.216766 17.066418 30.45578
data.table 6.624728 6.905202 7.543284 7.137171 7.482922 11.67061
neval
100
100
100
100