在 Rcpp 中按行名和列名子集 NumericMatrix

Question

我正在尝试在 Rcpp 中创建一个函数，它将一个成对的数字矩阵以及一个向量列表作为输入，每个元素都是 row/column 个名称的子集。我希望此函数识别与这些名称匹配的矩阵子集，以及 return 值的平均值。

下面我生成了一些类似于我拥有的数据类型的虚拟数据，然后尝试使用 Rcpp 函数。

library(Rcpp)

dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

这里我有一个列表，由 pdist

中 row/column 名称的各个子集的字符向量组成

subsetList <- replicate(10, sample(names(dat), 4), simplify=FALSE)

对于每组名称，我想确定成对矩阵的子集并取值的平均值

这是我目前所拥有的，它不起作用，但我认为它说明了我想要达到的目标。

cppFunction('
    List meanDistByCell(List input, NumericMatrix pairmat) {
    int n = input.size();
    List out(n);

    List dimnames = pairmat.attr( "dimnames" );
    CharacterVector colnames = dimnames[1];

    for (int i = 0; i < n; i++) {

        CharacterVector sp = as< CharacterVector >(input[i]);

        if (sp.size() > 0) {
            out[i] = double(mean(pairmat(sp, sp)));
        } else {
            out[i] = NA_REAL;
        }
    }
    return out;
}
')

如有任何帮助，我们将不胜感激！谢谢！

Answer 1

虽然（连续）range-based 子集可用（例如 x(Range(first_row, last_row), Range(first_col, last_col))），正如 coatless 指出的那样，目前不支持 CharacterVector 子集，因此您将不得不滚动您的暂时拥有。 general-ish 方法可能看起来像这样：

template <int RTYPE> inline Matrix<RTYPE>
Subset2D(const Matrix<RTYPE>& x, CharacterVector crows, CharacterVector ccols) {
    R_xlen_t i = 0, j = 0, rr = crows.length(), rc = ccols.length(), pos;
    Matrix<RTYPE> res(rr, rc);

    CharacterVector xrows = rownames(x), xcols = colnames(x);
    IntegerVector rows = match(crows, xrows), cols = match(ccols, xcols);

    for (; j < rc; j++) {
        // NB: match returns 1-based indices
        pos = cols[j] - 1;
        for (i = 0; i < rr; i++) {
            res(i, j) = x(rows[i] - 1, pos);
        }
    }

    rownames(res) = crows;
    colnames(res) = ccols;

    return res;
}

// [[Rcpp::export]]
NumericMatrix subset2d(NumericMatrix x, CharacterVector rows, CharacterVector cols) {
    return Subset2D(x, rows, cols);
}

这假设输入矩阵同时具有行名和列名，并且行和列查找向量是这些 dimnames 的有效子集；可以添加额外的防御代码以使其更加健壮。为了演示，

subset2d(pdist, subsetList[[1]], subsetList[[1]])
#     spB spD spE spC
# spB  NA  NA  NA  NA
# spD   9  NA  NA   7
# spE   5   4  NA   3
# spC   2  NA  NA  NA

pdist[subsetList[[1]], subsetList[[1]]]
#     spB spD spE spC
# spB  NA  NA  NA  NA
# spD   9  NA  NA   7
# spE   5   4  NA   3
# spC   2  NA  NA  NA

Subset2D 处理实施 meanDistByCell 中涉及的大部分样板文件；剩下的就是遍历输入列表，将其应用于每个列表元素，并将结果的平均值存储在输出列表中：

// [[Rcpp::export]]
List meanDistByCell(List keys, NumericMatrix x, bool na_rm = false) {
    R_xlen_t i = 0, sz = keys.size();
    List res(sz);

    if (!na_rm) {
        for (; i < sz; i++) {
            res[i] = NumericVector::create(
                mean(Subset2D(x, keys[i], keys[i]))
            );
        }
    } else {
        for (; i < sz; i++) {
            res[i] = NumericVector::create(
                mean(na_omit(Subset2D(x, keys[i], keys[i])))
            );
        }
    }

    return res;
}

all.equal(
    lapply(subsetList, function(x) mean(pdist[x, x], na.rm = TRUE)),
    meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE

尽管使用 Subset2D 可以更清晰地实现 meanDistByCell，但在这种情况下，它的效率很低，原因至少有几个：

它设置 return 对象的 dimnames (rownames(res) = crows;, colnames(res) = ccols;)，这里不需要。
它调用 match 来获取 rownames 和 colnames 的 each 的索引，这是不必要的，因为你知道推进 rownames(x) == colnames(x)。

对于长度为 k 的输入列表，您将承担这两个点的成本 k 次。

一种更有效但因此不够简洁的方法是基本上只实现 Subset2D 需要的方面，在 meanDistByCell 内联：

// [[Rcpp::export]]
List meanDistByCell2(List keys, NumericMatrix x, bool na_rm = false) {
    R_xlen_t k = 0, sz = keys.size(), i = 0, j = 0, nidx, pos;
    List res(sz);
    CharacterVector cx = colnames(x);

    if (!na_rm) {
        for (; k < sz; k++) {
            // NB: match returns 1-based indices
            IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
            nidx = idx.size();
            NumericVector tmp(nidx * nidx);

            for (j = 0; j < nidx; j++) {
                pos = idx[j];
                for (i = 0; i < nidx; i++) {
                    tmp[nidx * j + i] = x(idx[i], pos);
                }
            }

            res[k] = NumericVector::create(mean(tmp));
        }
    } else {
        for (; k < sz; k++) {
            IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
            nidx = idx.size();
            NumericVector tmp(nidx * nidx);

            for (j = 0; j < nidx; j++) {
                pos = idx[j];
                for (i = 0; i < nidx; i++) {
                    tmp[nidx * j + i] = x(idx[i], pos);
                }
            }

            res[k] = NumericVector::create(mean(na_omit(tmp)));
        }
    }

    return res;
}

all.equal(
    meanDistByCell(subsetList, pdist, TRUE),
    meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE

在 Rcpp 中按行名和列名子集 NumericMatrix

subset NumericMatrix by row and column names in Rcpp

rcpp