在字典中聚类句子向量

Question

我正在处理一种独特的情况。我在 Language1 中有一些用英语定义的单词。然后我获取每个英文单词，从预训练的 GoogleNews w2v 模型中获取其单词向量，并对每个定义的向量进行平均。结果，一个 3 维向量的例子：

L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}

我想做的是通过 numpy 数组值对字典的键进行聚类（可能使用 K-means，但我对其他想法持开放态度）。我以前用标准的 w2v 模型做过这个，但我遇到的问题是这是一本字典。我可以将其转换为另一个数据集吗？我倾向于将它写入 csv/make 到 pandas 数据帧并使用 Pandas 或 R 来处理它，但我被告知浮点数是问题时谈到需要二进制的东西（如：它们以不可预测的方式丢失信息）。我尝试将我的字典保存到 hdf5，但不支持字典。

提前致谢！

Answer 1

如果我对你的问题理解正确，你想根据 W2V 表示对单词进行聚类，但你将其保存为字典表示。如果是这样的话，我认为这根本不是一个独特的情况。您所要做的就是将字典转换为矩阵，然后在矩阵中进行聚类。如果您将矩阵中的每一行表示为字典中的一个词，您应该能够在聚类后引用这些词。

我无法测试下面的代码，因此它可能无法完全发挥作用，但思路如下：

from nltk.cluster import KMeansClusterer
import nltk

# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
    X.append(L1_words[w])

# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

# print the cluster each word belongs
for i in range(len(X)):
    print(words[i], assigned_clusters[i])

您可以在此 link 中阅读更多详细信息。

在字典中聚类句子向量

Clustering sentence vectors in a dictionary

python

dictionary

nlp

cluster-analysis