只有单词或数字重新排列。使用 CountVectorizer 进行分词

Question

我正在使用 python CountVectorizer 来标记句子，同时过滤不存在的单词，例如“1s2”。

对于 select 只有英文单词和数字，我应该使用哪种重新模式？以下正则表达式模式让我非常接近：

pattern = '(?u)(?:\b[a-zA-Z]+\b)*(?:\b[\d]+\b)*'

vectorizer = CountVectorizer(ngram_range=(1, 1),
                             stop_words=None,
                             token_pattern=pattern)
tokenize = vectorizer.build_tokenizer()

tokenize('this is a test test1 and 12.')

['this', '', 'is', '', 'a', '', 'test', '', '', '', '',
 '', '', '', '', 'and', '', '12', '', '']

但我不明白为什么它会给我这么多空列表项 ('')。

另外，标点符号怎么保留？最后我想得到这样的结果：

tokenize('this is a test test1 and 12.')

['this','is','a','test','and','12','.']

Answer 1

我不知道sklearn的CountVectorizer是否可以一步完成（我认为token_pattern被tokenizer覆盖），但你可以这样做（基于）：

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re

vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
                             tokenizer=TreebankWordTokenizer().tokenize)
tokenize = vectorizer.build_tokenizer()

tokenList = tokenize('this is a test test1 and 12.')
# ['this', 'is', 'a', 'test', 'test1', 'and', '12', '.']

# Remove any token that (i) does not consist of letters or (ii) is a punctuation mark
tokenList = [token for token in tokenList if re.match('^([a-zA-Z]+|\d+|\W)$', token)]
# ['this', 'is', 'a', 'test', 'and', '12', '.']

编辑：我忘了告诉你为什么你的答案不起作用。

"The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)." (How sklearn's token_pattern works)。所以标点符号被完全忽略了。
您的模式 (?u)(?:\b[a-zA-Z]+\b)*(?:\b[\d]+\b)* 实际上是在说：'Interpret as unicode, word boundaries with letters in between (or not (the *)) and word boundaries with digits in between (or not (again a *))'。由于所有 'or not'，像 '' (nothing) 这样的模式也是您要搜索的内容！

只有单词或数字重新排列。使用 CountVectorizer 进行分词

Only words or numbers re pattern. Tokenize with CountVectorizer

python

regex

nlp