tfidfvectorizer 根据所有单词打印结果
tfidfvectorizer prints results based on all words
尽管有六个不同的词。结果中只打印了 5 个字。如何根据所有单词(6列向量)得到结果?
from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
[[ 0. 0. 0.50154891 0.70490949 0.50154891] [
0.57615236 0.57615236 0.40993715 0. 0.40993715]]
还有如何打印列详细信息(特征(单词))和行(文档)?
您正在使用默认 token_pattern,它只选择 2 个或更多字符的标记。
token_pattern :
“token”, only used if analyzer == 'word'. The default regexp selects
tokens of 2 or more alphanumeric characters (punctuation is completely
ignored and always treated as a token separator)
如果你定义一个新的token_pattern,你会得到'a'字符,例如:
from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(token_pattern=u'(?u)\b\w+\b')
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
tf.vocabulary_
[[ 0.57615236 0. 0. 0.40993715 0.57615236 0.40993715]
[ 0. 0.57615236 0.57615236 0.40993715 0. 0.40993715]]
tf.vocabulary_
{u'a': 0, u'sample': 4, u'another': 1, u'this': 5, u'is': 3, u'example': 2}
尽管有六个不同的词。结果中只打印了 5 个字。如何根据所有单词(6列向量)得到结果?
from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
[[ 0. 0. 0.50154891 0.70490949 0.50154891] [ 0.57615236 0.57615236 0.40993715 0. 0.40993715]]
还有如何打印列详细信息(特征(单词))和行(文档)?
您正在使用默认 token_pattern,它只选择 2 个或更多字符的标记。
token_pattern :
“token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)
如果你定义一个新的token_pattern,你会得到'a'字符,例如:
from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(token_pattern=u'(?u)\b\w+\b')
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
tf.vocabulary_
[[ 0.57615236 0. 0. 0.40993715 0.57615236 0.40993715] [ 0. 0.57615236 0.57615236 0.40993715 0. 0.40993715]]
tf.vocabulary_
{u'a': 0, u'sample': 4, u'another': 1, u'this': 5, u'is': 3, u'example': 2}