您可以添加到 scikit-learn 中的 CountVectorizer 吗?
Can you add to a CountVectorizer in scikit-learn?
我想在 scikit-learn 中基于文本语料库创建一个 CountVectorizer,然后稍后将更多文本添加到 CountVectorizer(添加到原始词典)。
如果我使用transform()
,它确实保留了原来的词汇,但没有增加新词。如果我使用 fit_transform()
,它只是从头开始重新生成词汇表。见下文:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.transform(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}
In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]:
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
我想要一个 update()
函数的等价物。我希望它像这样工作:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.update(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
有办法吗?
scikit-learn
中实现的算法被设计为一次适应所有数据,这对大多数 ML 算法来说是必需的(尽管有趣的不是你描述的应用程序),所以没有 update
功能。
有一种方法可以通过略微不同的方式获得您想要的东西,请参见以下代码
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_
输出
{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}
我想在 scikit-learn 中基于文本语料库创建一个 CountVectorizer,然后稍后将更多文本添加到 CountVectorizer(添加到原始词典)。
如果我使用transform()
,它确实保留了原来的词汇,但没有增加新词。如果我使用 fit_transform()
,它只是从头开始重新生成词汇表。见下文:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.transform(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}
In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]:
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
我想要一个 update()
函数的等价物。我希望它像这样工作:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.update(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
有办法吗?
scikit-learn
中实现的算法被设计为一次适应所有数据,这对大多数 ML 算法来说是必需的(尽管有趣的不是你描述的应用程序),所以没有 update
功能。
有一种方法可以通过略微不同的方式获得您想要的东西,请参见以下代码
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_
输出
{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}