使用 rcpp 计算数据框中的组合数

Question

给定一个包含多列的 data.frame，为了确保更好的性能，使用 rcpp 而不是单独使用 R 来计算列中值组合的最快方法是什么？

例如，假设以下 data.frame 名为 df，包含 A、B、C、D、E 列

     A  B  C  D  E
  1  1  1  1  1  2 
  2  1  1  1  1  2
  3  2  2  2  2  3
  4  2  2  2  2  3 
  5  3  3  3  3  1

预期输出如下：

     A  B  C  D  E count
  1  1  1  1  1  2 2
  2  2  2  2  2  3 2
  3  3  3  3  3  1 1

在R中，可以通过创建一个合并现有列的新列并使用table找到计数来完成，即：

df$combine <- do.call(paste, c(df, sep = "-"))
tab <- as.data.frame(table(df$combine))

因为 R 中的数据消息和 table 命令的性能有点慢，有没有人知道在 Rcpp 中做同样的快速方法？

Answer 1

好的，这是我能想到的一种方法。

首先，我们真的不能在 Rcpp 中使用 Rcpp::DataFrame 对象类型，因为它实际上是一个松散的向量列表。因此，我通过创建与采样数据匹配的 Rcpp::NumericMatrix 降低了这个问题的门槛。从这里开始，可以使用 std::map 来计算唯一行数。这是简化的，因为 Rcpp::NumericMatrix 有一个 .row() 属性启用逐行子集。因此，每一行然后被转换为 std::vector<T>，用作地图的键。然后，我们将每个 std::vector<T> 添加到 std::map 并增加其计数值。最后，我们将 std::map 导出为所需的矩阵格式。

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericMatrix unique_rows( Rcpp::NumericMatrix & v)
{

  // Initialize a map
  std::map<std::vector<double>, int> count_rows;

  // Clear map
  count_rows.clear();

  // Count each element
  for (int i = 0; i != v.nrow(); ++i) {
    // Pop from R Matrix
    Rcpp::NumericVector a = v.row(i);
    // Convert R vector to STD vector
    std::vector<double> b = Rcpp::as< std::vector<double> >(a);

    // Add to map
    count_rows[ b ] += 1;
  }

  // Make output matrix
  Rcpp::NumericMatrix o(count_rows.size(), v.ncol()+1);

  // Hold count iteration
  unsigned int count = 0;

  // Start at the 1st element and move to the last element in the map.
  for( std::map<std::vector<double>,int>::iterator it = count_rows.begin();
       it != count_rows.end(); ++it )
  {

    // Grab the key of the matrix
    std::vector<double> temp_o = it->first;

    // Tack on the vector, probably can be speed up. 
    temp_o.push_back(it->second);

    // Convert from std::vector to Rcpp::NumericVector
    Rcpp::NumericVector mm = Rcpp::wrap(temp_o);

    // Store in a NumericMatrix
    o.row(count) = mm;

    count++;
  }

  return o;
}

然后我们选择：

a = matrix(c(1, 1, 1, 1, 2, 
1, 1, 1, 1, 2,
2, 2, 2, 2, 3,
2, 2, 2, 2, 3, 
3, 3, 3, 3, 1), ncol = 5, byrow = T)


unique_rows(a)

给予：

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    1    2    2
[2,]    2    2    2    2    3    2
[3,]    3    3    3    3    1    1

使用 rcpp 计算数据框中的组合数

Counting number of combinations in a dataframe using rcpp

r

rcpp