Cython

Question

我正在尝试扩展 sklearn 中的 Splitter class，它与 sklearn 的决策树 classes 一起工作。更具体地说，我想在新 class 中添加一个 feature_weights 变量，这将通过根据特征权重按比例改变纯度计算来影响最佳分割点的确定。

新的 class 几乎是 sklearn 的 BestSplitter class (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx) 的精确副本，只有微小的变化。这是我目前所拥有的：

cdef class WeightedBestSplitter(WeightedBaseDenseSplitter):

    cdef object feature_weights # new variable - 1D array of feature weights

    def __reduce__(self):
        # same as sklearn BestSplitter (basically)

    # NEW METHOD
    def set_weights(self, object feature_weights): 
        feature_weights = np.asfortranarray(feature_weights, dtype=DTYPE)
        self.feature_weights = feature_weights  

    cdef int node_split(self, double impurity, SplitRecord* split,
                        SIZE_t* n_constant_features) nogil except -1:

        # .... same as sklearn BestSplitter ....

        current_proxy_improvement = self.criterion.proxy_impurity_improvement()
        current_proxy_improvement *= self.feature_weights[<int>(current.feature)]  # new line

        # .... same as sklearn BestSplitter ....

关于上述内容的一些注意事项：我使用 object 变量类型和 np.asfortranarray 因为这是在其他地方定义和设置变量 X 的方式 X 的索引就像我试图索引 feature_weights 一样。此外，根据 _splitter.pxd 文件 (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pxd)，custom.feature 的变量类型为 SIZE_t。

问题似乎是由 self.feature_weights 的变量类型引起的。上面的代码抛出多个错误，但即使尝试引用 self.feature_weights[0] 之类的内容并将其设置为另一个变量也会抛出错误：

Indexing Python object not allowed without gil

我想知道我需要做什么才能索引 self.feature_weights 并将标量值用作乘数。

Answer 1

如果没有 GIL，您肯定无法索引通用 Python 对象（正如您正在尝试做的那样）。您可以在没有 GIL 的情况下索引类型化的内存视图。

定义 feature_weights 为

cdef double[:] feature_weights

Cython - 在 nogil 函数中索引 numpy 数组

Cython - Indexing numpy array within nogil function

python

scikit-learn