将 pandas 列添加到稀疏矩阵

Question

我有 X 变量的额外派生值，我想在我的模型中使用。

XAll = pd_data[['title','wordcount','sumscores','length']]
y = pd_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(XAll, y, random_state=1)

由于我正在处理标题中的文本数据，因此我首先将其单独转换为 dtm：

vect = CountVectorizer(max_df=0.5)
vect.fit(X_train['title'])
X_train_dtm = vect.transform(X_train['title'])
column_index = X_train_dtm.indices

print(type(X_train_dtm))    # This is <class 'scipy.sparse.csr.csr_matrix'>
print("X_train_dtm shape",X_train_dtm.get_shape())  # This is (856, 2016)
print("column index:",column_index)     # This is column index: [ 533  754  859 ...,  633  950 1339]

既然我已经将文本作为文档术语矩阵，我想添加其他功能，例如 'wordcount'、'sumscores'、'length' 到 X_train_dtm是数字。我将使用新的 dtm 创建模型，这样会更准确，因为我会插入附加功能。

如何将 pandas 数据框的额外数字列添加到稀疏 csr 矩阵？

Answer 1

找到解决方案。我们可以使用 sparse.hstack:

from scipy.sparse import hstack
X_train_dtm = hstack((X_train_dtm,np.array(X_train['wordcount'])[:,None]))

将 pandas 列添加到稀疏矩阵

Adding pandas columns to a sparse matrix

python

pandas

scikit-learn

sklearn-pandas