rcpp 中 CharacterVector 和 NumericVector 中元素的顺序

Order of elements in CharacterVector and NumericVector in rcpp

我从 list 中提取了 data.frame(即 data.frame 中的 list),我想读入 vector 在 Rcpp 中进行进一步操作。由于所有元素都是数字,我首先尝试将其读取为 NumericVector。但是,索引已更改。然后,我尝试将其读取为CharacterVector,保留原始顺序。

原来的data.frame是这样的:

       0  1 18 19 31 Freq Prob
   1   1  3 10 10  1    6 0.12
   2   1  5  1  1  1    1 0.02
   3  10  3 10  8 10    2 0.04
   4  10  7 10  9 10    1 0.02
   5  10  9 10 10 10    2 0.04
   6   2  3  2  6  2    1 0.02
   7   3  3  2  2  3    1 0.02

给定为:

structure(list(`0` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `1` = structure(c(4L, 6L, 4L, 8L, 10L, 4L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `18` = structure(c(2L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("1", "10", "2", "4", "5", "6", "7", "8", "9"), class = "factor"), `19` = structure(c(2L, 1L, 9L, 10L, 2L, 7L, 3L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `31` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), Freq = c(6L, 1L, 2L, 1L, 2L, 1L, 1L), Prob = c(0.12, 0.02, 0.04, 0.02, 0.04, 0.02, 0.02)), .Names = c("0", "1", "18", "19", "31", "Freq", "Prob"), row.names = c(NA, 7L), class = "data.frame")

各列的模式和class如下:

   > sapply(Model[[1]], mode)
            0         1        18        19        31      Freq      Prob 
    "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
   > sapply(Model[[1]], class)
           0         1        18        19        31      Freq      Prob 
    "factor"  "factor"  "factor"  "factor"  "factor" "integer" "numeric" 

注意:第一行是data.frame中列出的列名,第二行是应用函数的结果。

读入CharacterVectorNumericVectorRcpp如下:

  // [[Rcpp::export]]
  //x is the dataframe, idx is column to read      
  int dataframe1(DataFrame& x, int idx) { 
      Rcpp::CharacterVector columnChar = x[idx];
      Rcpp::NumericVector columnNum = x[idx];
      Rcpp::Rcout << columnChar << std::endl;
      Rcpp::Rcout << columnNum << std::endl;
      return (0);
  }

输出如下:说当index在R中为1,即在Rcpp中为0,

 dataframe1(Model[[1]],0)
 "1" "1" "10" "10" "10" "2" "3" "3" "3" "4" "4" "5" "5" "5" "6" "6" "6" "6"     "6" "7" "7" "7" "8" "8" "9"
 1 1 2 2 2 3 4 4 4 5 5 6 6 6 7 7 7 7 7 8 8 8 9 9 10

如您所见,两个向量的顺序不同,NumericVector 的向量已排序。但这只发生在因子列中,整数和数字列没有问题。

所以问题是在 Rcpp 中将因子读入 NumericVector 时如何保持顺序?

感谢

Rcppfactor 的内部表示有限。因此,您必须预先传入与每个因素关联的整数值。

这就是区别的原因:

Rcpp::Rcout << columnChar << std::endl; // reading from factor label
Rcpp::Rcout << columnNum << std::endl; // reading from id associated with factor label

编辑

要了解正在发生的事情,请考虑:

set.seed(133)
x = sample(1:10, 10, replace = F)
x

给出:

 [1]  6  8 10  3  2  4  7  9  5  1

这是纯数字。

现在,考虑一个因素:

xf = factor(x, labels = 11:20)

xf

给予:

[1] 16 18 20 13 12 14 17 19 15 11
Levels: 11 12 13 14 15 16 17 18 19 20

注意:x 的值不再存在。相反,它被映射到 11 到 20 之间的字符值所掩盖。这就是为什么您在数字输出中看到重复的 1 和 2,但在字符输出中看到 1 和 10。

接下来,如果我们转换为数字,我们有:

as.numeric(xf)

给予:

[1]  6  8 10  3  2  4  7  9  5  1

或"factorizing"

之前的原始值

获取实际等级:

as.numeric(as.character(xf))

Returns:

[1] 16 18 20 13 12 14 17 19 15 11

编辑 2:

看到这个,我们修改一下原来的函数:

#include <Rcpp.h>

// [[Rcpp::export]]
void dataframe_factors(Rcpp::DataFrame& x) { 
  Rcpp::CharacterVector factor_name = x[0];
  Rcpp::NumericVector factor_id = x[0];
  Rcpp::NumericVector numeric_val = x[1];
  Rcpp::Rcout << "FN: " << factor_name << std::endl;
  Rcpp::Rcout << "FID: " << factor_id << std::endl;

  // Numeric
  Rcpp::Rcout << "ORG: " << numeric_val << std::endl;

}


/*** R
set.seed(133)
x = sample(1:10, 10, replace = F)

xf = factor(x, labels = 11:20)

d = data.frame(xf, x)

dataframe_factors(d)
*/

给出:

FN: "16" "18" "20" "13" "12" "14" "17" "19" "15" "11"
FID: 6 8 10 3 2 4 7 9 5 1
ORG: 6 8 10 3 2 4 7 9 5 1