如何在聚类句子的单词之间找到 'connection'

Question

我需要将单词 4G 和 mobile phones 或 Internet 连接起来，以便将有关技术的句子聚类在一起。我有以下句子：

4G is the fourth generation of broadband network.
4G is slow. 
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone.

我需要在同一个集群中考虑以上句子。目前它没有，可能是因为它没有找到 4G 和移动之间的关系。我尝试先用 wordnet.synsets 来查找连接 4G 到互联网或手机的同义词 phone，但不幸的是它没有找到任何连接。将我正在做的句子聚类如下：

rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy

texts = ["4G is the fourth generation of broadband network.",
    "4G is slow.",
    "4G is defined as the fourth generation of mobile technology",
    "I bought a new mobile phone."]

# vectorization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words", words)


n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)

labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]

print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)

texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
    for label in labels:
        if label==i_cluster:
            texts_per_cluster[i_cluster] +=1 

print("Top words per cluster:")
for i_cluster in range(n_clusters):
    print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
    for term in ordered_words[i_cluster, :10]:
        print("\t"+words[term])

print("\n")
print("Prediction")

text_to_predict = "Why 5G is dangerous?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1

print(text_to_predict)
print("Cluster:", predicted_cluster, "texts:", int(texts_per_cluster[predicted_cluster])),
for term in ordered_words[predicted_cluster, :10]:
print("\t"+words[term])

如有任何帮助，我们将不胜感激。

Answer 1

正如@sergey-bushmanov 的评论所述，密集词嵌入（来自 word2vec 或类似算法）可能会有所帮助。

他们会将单词转换为密集的 high-dimensional 向量，其中具有相似 meanings/usages 的单词彼此靠近。甚至：space 中的某些方向通常与单词之间的种关系大致相关。

因此，word-vectors 在 sufficiently-representative（大而多变的）文本上训练将使 '4G' 和 'mobile' 的向量彼此靠近，然后如果您sentence-representations 是从 word-vectors 引导的，这可能有助于您的聚类。

使用单个 word-vectors 对句子建模的一种快速方法是使用所有句子 word-vectors 的平均值作为句子向量。这太简单了，无法对多种含义进行建模（尤其是那些来自语法和 word-order 的含义），但通常可以作为一个很好的基线，尤其是对于具有广泛话题性的问题。

另一种计算方法“Word Mover's Distance”将句子视为 word-vectors 的集合（不对它们进行平均），并且可以进行 sentence-to-sentence 比简单平均更有效的距离计算 – 但是计算更长的句子变得非常昂贵。

如何在聚类句子的单词之间找到 'connection'

How to find a 'connection' between words for clustering sentences

python

nlp

k-means

scikit-learn

word2vec