随机森林:如何向稀疏矩阵添加更多特征,并识别特征重要性的项目?

Random Forest: How to add more features to a sparse matrix, and identify the items in feature importance?

我需要使用词袋 (BOW) 生成的特征,以及随机森林模型中的额外特征(例如 Grp 和评级)。

  1. 由于 BOW 是一个稀疏矩阵,我该如何添加额外的特征来创建一个新的稀疏矩阵?目前,我将稀疏矩阵转换为密集矩阵并连接额外的特征以创建一个 df(例如 df 2)。有没有办法向 BOW 稀疏矩阵添加额外的特征?

  2. 如果我们使用稀疏矩阵作为 X 训练,我如何识别特征重要性的项目?目前我使用的是 df2.[​​=11=] 的列

谢谢

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier


bards_words =["The fool doth think he is wise,",
"man fool"]

vect = CountVectorizer()

bow=vect.fit_transform(bards_words)

vocab=vect.vocabulary_

new_vocab = dict([(value, key) for key, value in vocab.items()])

df0 = pd.DataFrame(bow.toarray())
df0.rename(columns=new_vocab , inplace=True)

df1 = pd.DataFrame({'Grp': ['3' , '10'],
                   'Rating': ['1', '2']
                   })



df2=pd.concat([df0, df1], axis=1)

X_train=df2.values

forest = RandomForestClassifier(n_estimators = 500, random_state=0) 
forest = forest.fit(X_train, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = df2.columns, columns=['importance']).sort_values('importance', ascending=False)

就用稀疏数据结构吧。目前,您将稀疏矩阵转换为密集矩阵,将数据帧转换为另一个数据帧,再转换为密集矩阵。那效率不高。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from scipy import sparse
import numpy as np
import pandas as pd

bards_words =["The fool doth think he is wise,",
"man fool"]

df1 = pd.DataFrame({'Grp': ['3' , '10'],
                   'Rating': ['1', '2']
                   })

vect = CountVectorizer()
bow=vect.fit_transform(bards_words)

# Stack the two df1 columns onto the left of the sparse matrix
bow = sparse.hstack((sparse.csr_matrix(df1.astype(int).values), bow))

# Keep track of features
features = np.concatenate((df1.columns.values, vect.get_feature_names()))

>>> features
array(['Grp', 'Rating', 'doth', 'fool', 'he', 'is', 'man', 'the', 'think',
       'wise'], dtype=object)

>>> bow.A
array([[ 3,  1,  1,  1,  1,  1,  0,  1,  1,  1],
       [10,  2,  0,  1,  0,  0,  1,  0,  0,  0]])

# Do your random forest
forest = RandomForestClassifier(n_estimators = 500, random_state=0) 
forest = forest.fit(bow, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)