如何在聚类句子的单词之间找到 'connection'

How to find a 'connection' between words for clustering sentences

我需要将单词 4Gmobile phonesInternet 连接起来,以便将有关技术的句子聚类在一起。 我有以下句子:

4G is the fourth generation of broadband network.
4G is slow. 
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone. 

我需要在同一个集群中考虑以上句子。目前它没有,可能是因为它没有找到 4G 和移动之间的关系。 我尝试先用 wordnet.synsets 来查找连接 4G 到互联网或手机的同义词 phone,但不幸的是它没有找到任何连接。 将我正在做的句子聚类如下:

rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy

texts = ["4G is the fourth generation of broadband network.",
    "4G is slow.",
    "4G is defined as the fourth generation of mobile technology",
    "I bought a new mobile phone."]

# vectorization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words", words)


n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)

labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]

print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)

texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
    for label in labels:
        if label==i_cluster:
            texts_per_cluster[i_cluster] +=1 

print("Top words per cluster:")
for i_cluster in range(n_clusters):
    print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
    for term in ordered_words[i_cluster, :10]:
        print("\t"+words[term])

print("\n")
print("Prediction")

text_to_predict = "Why 5G is dangerous?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1

print(text_to_predict)
print("Cluster:", predicted_cluster, "texts:", int(texts_per_cluster[predicted_cluster])),
for term in ordered_words[predicted_cluster, :10]:
print("\t"+words[term])

如有任何帮助,我们将不胜感激。

正如@sergey-bushmanov 的评论所述,密集词嵌入(来自 word2vec 或类似算法)可能会有所帮助。

他们会将单词转换为密集的 high-dimensional 向量,其中具有相似 meanings/usages 的单词彼此靠近。甚至:space 中的某些方向通常与单词之间的 关系大致相关。

因此,word-vectors 在 sufficiently-representative(大而多变的)文本上训练将使 '4G''mobile' 的向量彼此靠近,然后如果您sentence-representations 是从 word-vectors 引导的,这可能有助于您的聚类。

使用单个 word-vectors 对句子建模的一种快速方法是使用所有句子 word-vectors 的平均值作为句子向量。这太简单了,无法对多种含义进行建模(尤其是那些来自语法和 word-order 的含义),但通常可以作为一个很好的基线,尤其是对于具有广泛话题性的问题。

另一种计算方法“Word Mover's Distance”将句子视为 word-vectors 的集合(不对它们进行平均),并且可以进行 sentence-to-sentence 比简单平均更有效的距离计算 – 但是计算更长的句子变得非常昂贵。