从 TfidfVectorizer 获取全文

Question

我正在绘制一组二维文本文档，我注意到一些异常值，我希望能够找出这些异常值是什么。我正在使用原始文本，然后使用 SKLearn 中内置的 TfidfVectorizer。

  vectorizer = TfidfVectorizer(max_df=0.5, max_features=None,
                                 min_df=2, stop_words='english',
                                 use_idf=True, lowercase=True)

  corpus = make_corpus(root)
  X = vectorizer.fit_transform(corpus)

为了减少到 2D，我正在使用 TruncatedSVD。

reduced_data = TruncatedSVD(n_components=2).fit_transform(X)

如果我想找到哪个文本文档具有最高的第二主成分（y 轴），我该怎么做？

Answer 1

因此，据我了解，您想知道哪个文档最大化了特定的主成分。这是我想出的玩具示例：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

corpus = [
    'this is my first corpus',
    'this is my second corpus which is longer than the first',
    'here is yet another one, but it is brief',
    'and watch out for number four chuggin along',
    'blah blah blah my final sentence yada yada yada'
]

vectorizer = TfidfVectorizer(stop_words='english',
                             use_idf=True, lowercase=True)

# first get TFIDF matrix
X = vectorizer.fit_transform(corpus)

# second compress to two dimensions
svd = TruncatedSVD(n_components=2).fit(X)
reduced = svd.transform(X)

# now, find the doc with the highest 2nd prin comp
corpus[np.argmax(reduced[:, 1])]

产生：

'and watch out for number four chuggin along'

从 TfidfVectorizer 获取全文

Get full text from TfidfVectorizer

python

tf-idf

scikit-learn