CountVectorizer 中的令牌模式，scikit-learn

Question

所以我有如下关键字列表，

[u"ALZHEIMER'S DISEASE, OLFACTORY, AGING", 
 u"EEG, COGNITIVE CONTROL, FATIGUE", 
 u"AGING, OBESITY, GENDER", 
 u"AGING, COGNITIVE CONTROL, BRAIN IMAGING"]

然后我想使用 CountVectorizer 进行标记化，以便我的模型具有以下字典：

[{'ALZHEIMER\'S DISEASE': 0, 'OLFACTORY': 1, 'AGING': 2, 'BRAIN IMAGING': 3, ...}]

基本上，我想将逗号作为我的分词模式（最后一个除外）。但是，请随意在每个列表的末尾连接 ,。这是我现在拥有的代码片段：

from sklearn.feature_extraction.text import CountVectorizer
ls = [u"ALZHEIMER'S DISEASE, OLFACTORY, AGING", 
      u"EEG, COGNITIVE CONTROL, FATIGUE", 
      u"AGING, OBESITY, GENDER", 
      u"AGING, COGNITIVE CONTROL, BRAIN IMAGING"]
tfidf_model = CountVectorizer(min_df=1, max_df=1, token_pattern=r'(\w{1,}),')
tfidf_model.fit_transform(ls)
print tfidf_model.vocabulary_.keys()
>>> [u'obesity', u'eeg', u'olfactory', u'disease']

如果您想了解更多信息，请随时发表评论。

Answer 1

这是我做的回答。我首先将每个文档转换为列表列表（每个都是术语）。

docs = list(map(lambda s: s.lower().split(', '), ls)) # list of list

我创建了一个函数来根据列表中的单词生成字典，然后将单词列表转换为稀疏矩阵

import collections
from itertools import chain

def tag_to_sparse(docs):
    docs_list = list(chain(*docs))
    docs_list = [w for doc in docs for w in doc]
    counter = collections.Counter(docs_list)
    count_pairs = sorted(counter.items(), key=lambda x: -x[1])
    vocabulary = dict([(c[0], i) for i, c in enumerate(count_pairs)])

    row_ind = list()
    col_ind = list()
    for i, doc in enumerate(docs):
        for w in doc:
            row_ind.append(i)
            col_ind.append(vocabulary[w])
    value = [1]*len(row_ind)
    X = sp.csr_matrix((value, (row_ind, col_ind)))
    X.sum_duplicates()
    return X, vocabulary

我可以直接把它X, vocabulary = tag_to_sparse(docs)得到稀疏矩阵和词汇字典。

我刚刚找到了答案，因此您可以使用 tokenizer

欺骗 scikit-learn 来识别 ,

vocabulary = list(map(lambda x: x.lower().split(', '), ls))
vocabulary = list(np.unique(list(chain(*vocabulary))))

from sklearn.feature_extraction.text import CountVectorizer
model = CountVectorizer(vocabulary=vocabulary, tokenizer=lambda x: x.split(', '))
X = model.fit_transform(ls)

CountVectorizer 中的令牌模式，scikit-learn

Token pattern in CountVectorizer, scikit-learn

python

regex

scikit-learn