将列添加到 Tfidf 矩阵
Add column to Tfidf matrix
我想使用单词以及一些附加功能(例如,有链接)在文本上建立分类模型
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
我使用 sklearn 来获取文本数据的稀疏矩阵
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000,
min_df=0.1, stop_words='english',
use_idf=True, ntlk.tokenize,ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
我想向其中添加列以支持我的文本数据的其他功能。我试过:
import scipy as sc
all_data = sc.hstack((tfidf_matrix, [1,0,1]))
这给我的数据如下所示:
array([ <3x8 sparse matrix of type '<type 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>,
1, 1, 0], dtype=object)
当我将此数据框提供给模型时:
`from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(all_data, y)`
我收到回溯错误:
`Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/c/Desktop/features.py", line 157, in <module>
clf = MultinomialNB().fit(all_data, y)
File "C:\Anaconda\lib\site-packages\sklearn\naive_bayes.py", line 302, in fit
_, n_features = X.shape
ValueError:解压需要超过 1 个值`
编辑:数据的形状
`tfidf_matrix.shape
(100, 2)
all_data.shape
(100L,)`
我可以将列直接附加到稀疏矩阵吗?如果没有,我应该如何将数据转换为可以支持的格式?我担心稀疏矩阵以外的其他东西会增加内存占用。
将稀疏矩阵转换为密集矩阵
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
dense = tfidf_matrix.todense()
print dense.shape
newCol = [[1],[0],[1]]
allData = np.append(dense, newCol, 1)
print allData.shape
(3L, 10L)
(3L, 11L)
"Can I append columns directly to a sparse matrix?" - 是的。你可能应该这样做,因为解包(使用 todense
或 toarray
)很容易导致大型语料库中的内存爆炸。
import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
print tfidf_matrix.shape
(3, 10)
new_column = np.array([[1],[0],[1]])
print new_column.shape
(3, 1)
final = sp.sparse.hstack((tfidf_matrix, new_column))
print final.shape
(3, 11)
这是正确的形式:
all_data = sc.hstack([tfidf_matrix, sc.csr_matrix([1,0,1]).T], 'csr')
我想使用单词以及一些附加功能(例如,有链接)在文本上建立分类模型
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
我使用 sklearn 来获取文本数据的稀疏矩阵
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000,
min_df=0.1, stop_words='english',
use_idf=True, ntlk.tokenize,ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
我想向其中添加列以支持我的文本数据的其他功能。我试过:
import scipy as sc
all_data = sc.hstack((tfidf_matrix, [1,0,1]))
这给我的数据如下所示:
array([ <3x8 sparse matrix of type '<type 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>,
1, 1, 0], dtype=object)
当我将此数据框提供给模型时:
`from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(all_data, y)`
我收到回溯错误:
`Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/c/Desktop/features.py", line 157, in <module>
clf = MultinomialNB().fit(all_data, y)
File "C:\Anaconda\lib\site-packages\sklearn\naive_bayes.py", line 302, in fit
_, n_features = X.shape
ValueError:解压需要超过 1 个值`
编辑:数据的形状
`tfidf_matrix.shape
(100, 2)
all_data.shape
(100L,)`
我可以将列直接附加到稀疏矩阵吗?如果没有,我应该如何将数据转换为可以支持的格式?我担心稀疏矩阵以外的其他东西会增加内存占用。
将稀疏矩阵转换为密集矩阵
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
dense = tfidf_matrix.todense()
print dense.shape
newCol = [[1],[0],[1]]
allData = np.append(dense, newCol, 1)
print allData.shape
(3L, 10L)
(3L, 11L)
"Can I append columns directly to a sparse matrix?" - 是的。你可能应该这样做,因为解包(使用 todense
或 toarray
)很容易导致大型语料库中的内存爆炸。
import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
print tfidf_matrix.shape
(3, 10)
new_column = np.array([[1],[0],[1]])
print new_column.shape
(3, 1)
final = sp.sparse.hstack((tfidf_matrix, new_column))
print final.shape
(3, 11)
这是正确的形式:
all_data = sc.hstack([tfidf_matrix, sc.csr_matrix([1,0,1]).T], 'csr')