rcpp 中 CharacterVector 和 NumericVector 中元素的顺序
Order of elements in CharacterVector and NumericVector in rcpp
我从 list
中提取了 data.frame
(即 data.frame
中的 list
),我想读入 vector
在 Rcpp 中进行进一步操作。由于所有元素都是数字,我首先尝试将其读取为 NumericVector
。但是,索引已更改。然后,我尝试将其读取为CharacterVector
,保留原始顺序。
原来的data.frame
是这样的:
0 1 18 19 31 Freq Prob
1 1 3 10 10 1 6 0.12
2 1 5 1 1 1 1 0.02
3 10 3 10 8 10 2 0.04
4 10 7 10 9 10 1 0.02
5 10 9 10 10 10 2 0.04
6 2 3 2 6 2 1 0.02
7 3 3 2 2 3 1 0.02
给定为:
structure(list(`0` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1",
"10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"),
`1` = structure(c(4L, 6L, 4L, 8L, 10L, 4L, 4L), .Label = c("1",
"10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"),
`18` = structure(c(2L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("1",
"10", "2", "4", "5", "6", "7", "8", "9"), class = "factor"),
`19` = structure(c(2L, 1L, 9L, 10L, 2L, 7L, 3L), .Label = c("1",
"10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"),
`31` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1",
"10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"),
Freq = c(6L, 1L, 2L, 1L, 2L, 1L, 1L), Prob = c(0.12, 0.02,
0.04, 0.02, 0.04, 0.02, 0.02)), .Names = c("0", "1", "18",
"19", "31", "Freq", "Prob"), row.names = c(NA, 7L), class = "data.frame")
各列的模式和class如下:
> sapply(Model[[1]], mode)
0 1 18 19 31 Freq Prob
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> sapply(Model[[1]], class)
0 1 18 19 31 Freq Prob
"factor" "factor" "factor" "factor" "factor" "integer" "numeric"
注意:第一行是data.frame
中列出的列名,第二行是应用函数的结果。
读入CharacterVector
和NumericVector
的Rcpp
如下:
// [[Rcpp::export]]
//x is the dataframe, idx is column to read
int dataframe1(DataFrame& x, int idx) {
Rcpp::CharacterVector columnChar = x[idx];
Rcpp::NumericVector columnNum = x[idx];
Rcpp::Rcout << columnChar << std::endl;
Rcpp::Rcout << columnNum << std::endl;
return (0);
}
输出如下:说当index在R中为1,即在Rcpp中为0,
dataframe1(Model[[1]],0)
"1" "1" "10" "10" "10" "2" "3" "3" "3" "4" "4" "5" "5" "5" "6" "6" "6" "6" "6" "7" "7" "7" "8" "8" "9"
1 1 2 2 2 3 4 4 4 5 5 6 6 6 7 7 7 7 7 8 8 8 9 9 10
如您所见,两个向量的顺序不同,NumericVector
的向量已排序。但这只发生在因子列中,整数和数字列没有问题。
所以问题是在 Rcpp 中将因子读入 NumericVector
时如何保持顺序?
感谢
Rcpp
对 factor
的内部表示有限。因此,您必须预先传入与每个因素关联的整数值。
这就是区别的原因:
Rcpp::Rcout << columnChar << std::endl; // reading from factor label
Rcpp::Rcout << columnNum << std::endl; // reading from id associated with factor label
编辑
要了解正在发生的事情,请考虑:
set.seed(133)
x = sample(1:10, 10, replace = F)
x
给出:
[1] 6 8 10 3 2 4 7 9 5 1
这是纯数字。
现在,考虑一个因素:
xf = factor(x, labels = 11:20)
xf
给予:
[1] 16 18 20 13 12 14 17 19 15 11
Levels: 11 12 13 14 15 16 17 18 19 20
注意:x
的值不再存在。相反,它被映射到 11 到 20 之间的字符值所掩盖。这就是为什么您在数字输出中看到重复的 1 和 2,但在字符输出中看到 1 和 10。
接下来,如果我们转换为数字,我们有:
as.numeric(xf)
给予:
[1] 6 8 10 3 2 4 7 9 5 1
或"factorizing"
之前的原始值
获取实际等级:
as.numeric(as.character(xf))
Returns:
[1] 16 18 20 13 12 14 17 19 15 11
编辑 2:
看到这个,我们修改一下原来的函数:
#include <Rcpp.h>
// [[Rcpp::export]]
void dataframe_factors(Rcpp::DataFrame& x) {
Rcpp::CharacterVector factor_name = x[0];
Rcpp::NumericVector factor_id = x[0];
Rcpp::NumericVector numeric_val = x[1];
Rcpp::Rcout << "FN: " << factor_name << std::endl;
Rcpp::Rcout << "FID: " << factor_id << std::endl;
// Numeric
Rcpp::Rcout << "ORG: " << numeric_val << std::endl;
}
/*** R
set.seed(133)
x = sample(1:10, 10, replace = F)
xf = factor(x, labels = 11:20)
d = data.frame(xf, x)
dataframe_factors(d)
*/
给出:
FN: "16" "18" "20" "13" "12" "14" "17" "19" "15" "11"
FID: 6 8 10 3 2 4 7 9 5 1
ORG: 6 8 10 3 2 4 7 9 5 1
我从 list
中提取了 data.frame
(即 data.frame
中的 list
),我想读入 vector
在 Rcpp 中进行进一步操作。由于所有元素都是数字,我首先尝试将其读取为 NumericVector
。但是,索引已更改。然后,我尝试将其读取为CharacterVector
,保留原始顺序。
原来的data.frame
是这样的:
0 1 18 19 31 Freq Prob
1 1 3 10 10 1 6 0.12
2 1 5 1 1 1 1 0.02
3 10 3 10 8 10 2 0.04
4 10 7 10 9 10 1 0.02
5 10 9 10 10 10 2 0.04
6 2 3 2 6 2 1 0.02
7 3 3 2 2 3 1 0.02
给定为:
structure(list(`0` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `1` = structure(c(4L, 6L, 4L, 8L, 10L, 4L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `18` = structure(c(2L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("1", "10", "2", "4", "5", "6", "7", "8", "9"), class = "factor"), `19` = structure(c(2L, 1L, 9L, 10L, 2L, 7L, 3L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), `31` = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 4L), .Label = c("1", "10", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), Freq = c(6L, 1L, 2L, 1L, 2L, 1L, 1L), Prob = c(0.12, 0.02, 0.04, 0.02, 0.04, 0.02, 0.02)), .Names = c("0", "1", "18", "19", "31", "Freq", "Prob"), row.names = c(NA, 7L), class = "data.frame")
各列的模式和class如下:
> sapply(Model[[1]], mode)
0 1 18 19 31 Freq Prob
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> sapply(Model[[1]], class)
0 1 18 19 31 Freq Prob
"factor" "factor" "factor" "factor" "factor" "integer" "numeric"
注意:第一行是data.frame
中列出的列名,第二行是应用函数的结果。
读入CharacterVector
和NumericVector
的Rcpp
如下:
// [[Rcpp::export]]
//x is the dataframe, idx is column to read
int dataframe1(DataFrame& x, int idx) {
Rcpp::CharacterVector columnChar = x[idx];
Rcpp::NumericVector columnNum = x[idx];
Rcpp::Rcout << columnChar << std::endl;
Rcpp::Rcout << columnNum << std::endl;
return (0);
}
输出如下:说当index在R中为1,即在Rcpp中为0,
dataframe1(Model[[1]],0)
"1" "1" "10" "10" "10" "2" "3" "3" "3" "4" "4" "5" "5" "5" "6" "6" "6" "6" "6" "7" "7" "7" "8" "8" "9"
1 1 2 2 2 3 4 4 4 5 5 6 6 6 7 7 7 7 7 8 8 8 9 9 10
如您所见,两个向量的顺序不同,NumericVector
的向量已排序。但这只发生在因子列中,整数和数字列没有问题。
所以问题是在 Rcpp 中将因子读入 NumericVector
时如何保持顺序?
感谢
Rcpp
对 factor
的内部表示有限。因此,您必须预先传入与每个因素关联的整数值。
这就是区别的原因:
Rcpp::Rcout << columnChar << std::endl; // reading from factor label
Rcpp::Rcout << columnNum << std::endl; // reading from id associated with factor label
编辑
要了解正在发生的事情,请考虑:
set.seed(133)
x = sample(1:10, 10, replace = F)
x
给出:
[1] 6 8 10 3 2 4 7 9 5 1
这是纯数字。
现在,考虑一个因素:
xf = factor(x, labels = 11:20)
xf
给予:
[1] 16 18 20 13 12 14 17 19 15 11
Levels: 11 12 13 14 15 16 17 18 19 20
注意:x
的值不再存在。相反,它被映射到 11 到 20 之间的字符值所掩盖。这就是为什么您在数字输出中看到重复的 1 和 2,但在字符输出中看到 1 和 10。
接下来,如果我们转换为数字,我们有:
as.numeric(xf)
给予:
[1] 6 8 10 3 2 4 7 9 5 1
或"factorizing"
之前的原始值获取实际等级:
as.numeric(as.character(xf))
Returns:
[1] 16 18 20 13 12 14 17 19 15 11
编辑 2:
看到这个,我们修改一下原来的函数:
#include <Rcpp.h>
// [[Rcpp::export]]
void dataframe_factors(Rcpp::DataFrame& x) {
Rcpp::CharacterVector factor_name = x[0];
Rcpp::NumericVector factor_id = x[0];
Rcpp::NumericVector numeric_val = x[1];
Rcpp::Rcout << "FN: " << factor_name << std::endl;
Rcpp::Rcout << "FID: " << factor_id << std::endl;
// Numeric
Rcpp::Rcout << "ORG: " << numeric_val << std::endl;
}
/*** R
set.seed(133)
x = sample(1:10, 10, replace = F)
xf = factor(x, labels = 11:20)
d = data.frame(xf, x)
dataframe_factors(d)
*/
给出:
FN: "16" "18" "20" "13" "12" "14" "17" "19" "15" "11"
FID: 6 8 10 3 2 4 7 9 5 1
ORG: 6 8 10 3 2 4 7 9 5 1