Scikit learn Naive Bayes ValueError: dimension mismatch
Scikit learn Naive Bayes ValueError: dimension mismatch
我正在研究 Scikit-learn 中的朴素贝叶斯分类器。
在训练和预测阶段,我都使用以下代码从元组列表中获取 csr_matrix:
def convert_to_csr_matrix(vectors):
"""
convert list of tuples representation to scipy csr_matrix that is needed
for scikit learner
"""
logger.info("building the csr_sparse matrix representing tf-idf")
row = [[i] * len(v) for i, v in enumerate(vectors)]
row = list(chain(*row))
column = [j for j, _ in chain(*vectors)]
data = [d for _, d in chain(*vectors)]
return csr_matrix((data, (row, column)))
我主要基于
实现的
不幸的是,现在在预测阶段我收到以下错误:
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 93, in predict
top_predictions = self.top.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 124, in predict
category, res = model.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 176, in predict
prediction = self.clf.predict(item)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 64, in predict
jll = self._joint_log_likelihood(X)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 615, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 178, in safe_sparse_dot
ret = a * b
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/scipy/sparse/base.py", line 354, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
有谁知道哪里出了问题?我猜想稀疏向量以某种方式具有错误的维度。但我不明白为什么?
在调试过程中,我在朴素贝叶斯模型中提到的 feature_log_prob_
日志中打印出来,它看起来是:
[[-11.82052115 -12.51735721 -12.51735721 ..., -12.51735721 -11.60489688
-12.2132116 ]
[-12.21403023 -12.51130295 -12.51130295 ..., -11.84156341 -12.51130295
-12.51130295]]
和shape
:(2, 53961)
我的预测csr_matrix = (0, 7637) 0.770238101052
(0, 21849) 0.637756432886
并表示为元组列表,如下所示:[(7637, 0.7702381010520318), (21849, 0.6377564328862234)]
所以在对问题进行一些调查之后,我意识到可能的修复方法可能是:
def convert_to_csr_matrix(vectors):
"""
convert list of tuples representation to scipy csr_matrix that is needed
for scikit learner
"""
logger.info("building the csr_sparse matrix representing tf-idf")
row = [[i] * len(v) for i, v in enumerate(vectors)]
row = list(chain(*row))
column = [j for j, _ in chain(*vectors)]
data = [d for _, d in chain(*vectors)]
return csr_matrix((data, (row, column)))
第 return csr_matrix((data, (row, column)))
行应替换为 return csr_matrix((data, (row, column)), shape=(len(vectors), dimension))
我正在研究 Scikit-learn 中的朴素贝叶斯分类器。
在训练和预测阶段,我都使用以下代码从元组列表中获取 csr_matrix:
def convert_to_csr_matrix(vectors):
"""
convert list of tuples representation to scipy csr_matrix that is needed
for scikit learner
"""
logger.info("building the csr_sparse matrix representing tf-idf")
row = [[i] * len(v) for i, v in enumerate(vectors)]
row = list(chain(*row))
column = [j for j, _ in chain(*vectors)]
data = [d for _, d in chain(*vectors)]
return csr_matrix((data, (row, column)))
我主要基于
不幸的是,现在在预测阶段我收到以下错误:
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 93, in predict
top_predictions = self.top.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 124, in predict
category, res = model.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 176, in predict
prediction = self.clf.predict(item)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 64, in predict
jll = self._joint_log_likelihood(X)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 615, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 178, in safe_sparse_dot
ret = a * b
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/scipy/sparse/base.py", line 354, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
有谁知道哪里出了问题?我猜想稀疏向量以某种方式具有错误的维度。但我不明白为什么?
在调试过程中,我在朴素贝叶斯模型中提到的 feature_log_prob_
日志中打印出来,它看起来是:
[[-11.82052115 -12.51735721 -12.51735721 ..., -12.51735721 -11.60489688
-12.2132116 ]
[-12.21403023 -12.51130295 -12.51130295 ..., -11.84156341 -12.51130295
-12.51130295]]
和shape
:(2, 53961)
我的预测csr_matrix = (0, 7637) 0.770238101052
(0, 21849) 0.637756432886
并表示为元组列表,如下所示:[(7637, 0.7702381010520318), (21849, 0.6377564328862234)]
所以在对问题进行一些调查之后,我意识到可能的修复方法可能是:
def convert_to_csr_matrix(vectors):
"""
convert list of tuples representation to scipy csr_matrix that is needed
for scikit learner
"""
logger.info("building the csr_sparse matrix representing tf-idf")
row = [[i] * len(v) for i, v in enumerate(vectors)]
row = list(chain(*row))
column = [j for j, _ in chain(*vectors)]
data = [d for _, d in chain(*vectors)]
return csr_matrix((data, (row, column)))
第 return csr_matrix((data, (row, column)))
行应替换为 return csr_matrix((data, (row, column)), shape=(len(vectors), dimension))