连接句子的图形

Question

我有几个主题（两个）的句子列表，如下所示：

Sentences
Trump says that it is useful to win the next presidential election. 
The Prime Minister suggests the name of the winner of the next presidential election.
In yesterday's conference, the Prime Minister said that it is very important to win the next presidential election. 
The Chinese Minister is in London to discuss about climate change.
The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement.
The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration. 
The president Donald Trump states that he wants to win the presidential election. He has the support of his electors.

如您所见，句子之间存在相似性。

我正在尝试关联多个句子并通过使用图形（定向）可视化它们的特征。该图是通过应用如上所示的句子的行排序从相似矩阵构建的。我创建了一个新列，时间，以显示句子的顺序，所以第一行（特朗普说....）在时间 1；第二行（总理建议...）在时间 2，依此类推。像这样

Time    Sentences
1           Trump said that it is useful to win the next presidential election. 
2           The Prime Minister suggests the name of the winner of the next presidential election.

3           In today's conference, the Prime Minister said that it is very important to win the next presidential election. 

...

然后我想找到关系以便对主题有一个清晰的概述。一个句子的多个路径将表明有多个与之相关的信息。为了确定两个句子之间的相似性，我尝试如下提取名词和动词：

noun=[]
verb=[]
for  index, row in df.iterrows():
      nouns.append([word for word,pos in pos_tag(row[0]) if pos == 'NN'])
      verb.append([word for word,pos in pos_tag(row[0]) if pos == 'VB'])

因为它们是任何句子中的关键字。因此，当一个关键字（名词或动词）出现在句子 x 而没有出现在其他句子中时，它代表了这两个句子之间的差异。然而，我认为更好的方法可能是使用 word2vec 或 gensim (WMD)。

必须为每个句子计算这种相似度。我想构建一个图表来显示上面示例中句子的内容。由于有两个主题（特朗普和中国部长），我需要为每个主题寻找子主题。例如，特朗普有子主题总统选举。我图中的一个节点应该代表一个句子。每个节点中的单词代表句子的差异，显示句子中的新信息。例如，时间 5 的句子中的单词 states 在时间 6 和 7 的相邻句子中。我只想找到一种方法来获得如下图所示的类似结果。我尝试过主要使用名词和动词提取，但这可能不是正确的方法。我尝试做的是考虑时间 1 的句子并将其与其他句子进行比较，分配一个相似性分数（使用名词和动词提取以及 word2vec），并对所有其他句子重复它。但我现在的问题是如何提取差异来创建有意义的图表。

对于图的部分，我会考虑使用networkx（有向图）：

G = nx.DiGraph()
N = Network(directed=True)

显示关系的方向。

我提供了一个不同的例子来让它更清楚（但如果你使用前面的例子，它也会很好。很抱歉给你带来的不便，但由于我的第一个问题不是很清楚，我不得不还提供一个更好、可能更简单的示例）。

Answer 1

没有实现动词/名词分离的NLP，只是添加了一个好词列表。它们可以用 spacy 相对容易地提取和归一化。请注意 walk 出现在 1,2,5 个句子中，形成一个三元组。

import re
import networkx as nx
import matplotlib.pyplot as plt

plt.style.use("ggplot")

sentences = [
    "I went out for a walk or walking.",
    "When I was walking, I saw a cat. ",
    "The cat was injured. ",
    "My mum's name is Marylin.",
    "While I was walking, I met John. ",
    "Nothing has happened.",
]

G = nx.Graph()
# set of possible good words
good_words = {"went", "walk", "cat", "walking"}

# remove punctuation and keep only good words inside sentences
words = list(
    map(
        lambda x: set(re.sub(r"[^\w\s]", "", x).lower().split()).intersection(
            good_words
        ),
        sentences,
    )
)

# convert sentences to dict for furtehr labeling
sentences = {k: v for k, v in enumerate(sentences)}

# add nodes
for i, sentence in sentences.items():
    G.add_node(i)

# add edges if two nodes have the same word inside
for i in range(len(words)):
    for j in range(i + 1, len(words)):
        for edge_label in words[i].intersection(words[j]):
            G.add_edge(i, j, r=edge_label)

# compute layout coords
coord = nx.spring_layout(G)

plt.figure(figsize=(20, 14))

# set label coords a bit upper the nodes
node_label_coords = {}
for node, coords in coord.items():
    node_label_coords[node] = (coords[0], coords[1] + 0.04)

# draw the network
nodes = nx.draw_networkx_nodes(G, pos=coord)
edges = nx.draw_networkx_edges(G, pos=coord)
edge_labels = nx.draw_networkx_edge_labels(G, pos=coord)
node_labels = nx.draw_networkx_labels(G, pos=node_label_coords, labels=sentences)
plt.title("Sentences network")
plt.axis("off")

更新
如果你想衡量不同句子之间的相似度，你可能想要计算句子嵌入之间的差异。
这使您有机会找到具有不同单词的句子之间的语义相似性，例如“一场有多位男性参加的足球比赛”和“有些人正在参加一项运动”。几乎可以找到使用 BERT 的 SoTA 方法 here, more simple approaches are 。
由于您有相似性度量，仅当相似性度量大于某个阈值时，只需替换 add_edge 块以添加新边。生成的添加边代码如下所示：

# add edges if two nodes have the same word inside
tresold = 0.90
for i in range(len(words)):
    for j in range(i + 1, len(words)):
        # suppose you have some similarity function using BERT or PCA
        similarity = check_similarity(sentences[i], sentences[j])
        if similarity > tresold:
            G.add_edge(i, j, r=similarity)

Answer 2

处理此问题的一种方法是标记化、删除停用词并创建词汇表。然后根据这个词汇画出图形。我在下面展示了基于 unigram 的标记的示例，但更好的方法是识别短语 (ngram) 并将它们用作词汇而不是 unigram。类似地，句子将由具有更多 in 和 degree 的节点（和相应的句子）以图形方式描绘。

样本：

from sklearn.feature_extraction.text import CountVectorizer
import networkx as nx
import matplotlib.pyplot as plt


corpus = [
  "Trump says that it is useful to win the next presidential election",
  "The Prime Minister suggests the name of the winner of the next presidential election",
  "In yesterday conference, the Prime Minister said that it is very important to win the next presidential election",
  "The Chinese Minister is in London to discuss about climate change",
  "The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement",
  "The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration",
  "The president Donald Trump states that he wants to win the presidential election. He has the support of his electors",
]

vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words="english")
vectorizer.fit_transform(corpus)


G = nx.DiGraph()
G.add_nodes_from(vectorizer.get_feature_names())

all_edges = []
for s in corpus:
  edges = []
  previous = None
  for w in s.split():
    w = w.lower()
    if w in vectorizer.get_feature_names():
      if previous:
        edges.append((previous, w))
        #print (previous, w)
      previous = w   

  all_edges.append(edges)


plt.figure(figsize=(20,20))
pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G, pos, node_size = 500)
nx.draw_networkx_labels(G, pos)
colors = ['r', 'g', 'b', 'y', 'm', 'c', 'k']
for i, edges in enumerate(all_edges):
  nx.draw_networkx_edges(G, pos, edgelist=edges, edge_color=colors[i], arrows=True)
#nx.draw_networkx_edges(G, pos, edgelist=black_edges, arrows=False)
plt.show()

输出：

连接句子的图形

Graph to connect sentences

python

nlp

nltk

networkx

word2vec