在 matplotlib 中绘制矢量化文本文档？

Question

我已经将一大堆 PDF 文档转换为文本，然后将它们编译成字典，我知道我有 3 种不同的文档类型，我想使用 Clustering 自动对它们进行分组：

dict_of_docs = {'document_1':'contents of document', 'document_2':'contents of document', 'document_3':'contents of document',...'document_100':'contents of document'}

然后，我将字典的值向量化：

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())

我的 X 输出是这样的：

  (0, 768)  0.05895270500636258
  (0, 121)  0.11790541001272516
  (0, 1080) 0.05895270500636258
  (0, 87)   0.2114378682212116
  (0, 1458) 0.1195944498355368
  (0, 683)  0.0797296332236912
  (0, 1321) 0.12603709835806634
  (0, 630)  0.12603709835806634
  (0, 49)   0.12603709835806634
  (0, 750)  0.12603709835806634
  (0, 1749) 0.10626171032944469
  (0, 478)  0.12603709835806634
  (0, 1632) 0.14983692373373858
  (0, 177)  0.12603709835806634
  (0, 653)  0.0497440271723707
  (0, 1268) 0.13342186854440274
  (0, 1489) 0.07052056544031632
  (0, 72)   0.12603709835806634
  ...etc etc

然后，我将它们转换成一个数组，X = X.toarray()

我现在正处于尝试使用我的真实数据通过 matplotlib 散点图集群的阶段。从那里我想使用我在聚类方面学到的知识来对文档进行排序。我遵循的所有指南都使用组成的数据数组，但它们没有说明如何从真实世界的数据转变为可以按照它们所展示的方式使用的数据。

如何将矢量化数据数组放入散点图中？

Answer 1

How do I get my array of vectorised data into a scatter plot?

只需几个步骤：聚类、降维、绘图和调试。

聚类：

我们使用 K-Means 来拟合 X（我们的 TF-IDF 矢量化数据集）。

from sklearn.cluster import KMeans

NUMBER_OF_CLUSTERS = 3
km = KMeans(
    n_clusters=NUMBER_OF_CLUSTERS, 
    init='k-means++', 
    max_iter=500)
km.fit(X)

降维：

TF-IDF是一个矩阵。我们需要 2~3 个维度来绘图。
我们可以应用 PCA，然后绘制两个最重要的主成分（前两个）。

from sklearn.decomposition import PCA

# First: for every document we get its corresponding cluster
clusters = km.predict(X)

# We train the PCA on the dense version of the tf-idf. 
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X.todense())

scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component

绘图：

我们用预先指定的颜色绘制每个集群。

import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}

# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
    ix = np.where(clusters == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

调试器source:

打印每个集群中的前 10 个词。

order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(3):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

# Cluster 0: com edu medical yeast know cancer does doctor subject lines
# Cluster 1: edu game games team baseball com year don pitcher writes
# Cluster 2: edu car com subject organization lines university writes article

在 matplotlib 中绘制矢量化文本文档？

Plotting vectorized text documents in matplotlib?

python

cluster-analysis

k-means

聚类：

降维：

绘图：

调试器source: