使用 scikit-learn 返回文档中的术语位置
Returning term position in a document with scikit-learn
我知道 scikit-learn 根据 documentation 遵循词袋 assumption/model。但是,有没有办法在计算 tf-idf 的同时提取术语位置?
例如,如果我有这些文件
document1 = "foo bar baz"
document2 = "bar bar baz"
我能以某种方式得到这个吗(tuple/list of term_id)
document1_terms = (1, 2, 3)
document2_terms = (2, 2, 3)
或(术语字典,以位置元组为值)
document1_terms = {1: (1, ), 2: (2, ), 3: (3, )}
document2_terms = {2: (1, 2), 3: (3, )}
你是这个意思吗?
In [13]: from sklearn.feature_extraction.text import CountVectorizer
In [14]: vectorize = CountVectorizer(min_df=1)
In [15]: document1 = "foo bar baz"
...: document2 = "bar bar baz dee"
...:
In [16]: documents = [document1, document2]
In [17]: d = vectorize.fit_transform(documents)
In [18]: vectorize.vocabulary_
Out[18]: {u'bar': 0, u'baz': 1, u'dee': 2, u'foo': 3}
In [19]: d.todense()
Out[19]:
matrix([[1, 1, 0, 1],
[2, 1, 1, 0]], dtype=int64)
经过反复试验,我找到了解决这个问题的方法。先创建贴子
vectorizer = CountVectorizer()
term_doc_freq = vectorizer.fit_transform(collection['document'])
然后用 term-id 元组表示每个文档
from functools import partial
def document_get_position(row, vectorizer):
result = tuple()
for token in vectorizer.build_tokenizer()(row['document']):
result = result + (vectorizer.vocabulary_.get(token),)
return result
positions = collection.apply(partial(document_get_position,
vectorizer=vectorizer),
axis=1)
我知道 scikit-learn 根据 documentation 遵循词袋 assumption/model。但是,有没有办法在计算 tf-idf 的同时提取术语位置?
例如,如果我有这些文件
document1 = "foo bar baz"
document2 = "bar bar baz"
我能以某种方式得到这个吗(tuple/list of term_id)
document1_terms = (1, 2, 3)
document2_terms = (2, 2, 3)
或(术语字典,以位置元组为值)
document1_terms = {1: (1, ), 2: (2, ), 3: (3, )}
document2_terms = {2: (1, 2), 3: (3, )}
你是这个意思吗?
In [13]: from sklearn.feature_extraction.text import CountVectorizer
In [14]: vectorize = CountVectorizer(min_df=1)
In [15]: document1 = "foo bar baz"
...: document2 = "bar bar baz dee"
...:
In [16]: documents = [document1, document2]
In [17]: d = vectorize.fit_transform(documents)
In [18]: vectorize.vocabulary_
Out[18]: {u'bar': 0, u'baz': 1, u'dee': 2, u'foo': 3}
In [19]: d.todense()
Out[19]:
matrix([[1, 1, 0, 1],
[2, 1, 1, 0]], dtype=int64)
经过反复试验,我找到了解决这个问题的方法。先创建贴子
vectorizer = CountVectorizer()
term_doc_freq = vectorizer.fit_transform(collection['document'])
然后用 term-id 元组表示每个文档
from functools import partial
def document_get_position(row, vectorizer):
result = tuple()
for token in vectorizer.build_tokenizer()(row['document']):
result = result + (vectorizer.vocabulary_.get(token),)
return result
positions = collection.apply(partial(document_get_position,
vectorizer=vectorizer),
axis=1)