如何在Python中找到两个单词之间的最短依赖路径?
How to find the shortest dependency path between two words in Python?
我试图在 Python 给定的依赖树中找到两个单词之间的依赖路径。
换句
Robots in popular culture are there to remind us of the awesomeness of
unbound human agency.
我使用 practnlptools (https://github.com/biplab-iitb/practNLPTools) 得到依赖解析结果如下:
nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)
也可以形象化为(图片取自https://demos.explosion.ai/displacy/)
"robots"和"are"之间的路径长度为1,"robots"和"awesomeness"之间的路径长度为4。
我的问题是上面的依赖解析结果,如何获取两个单词之间的依赖路径或依赖路径长度?
根据我当前的搜索结果,nltk 的 ParentedTree 有帮助吗?
谢谢!
你的问题很容易被认为是一个图形问题,我们必须找到两个节点之间的最短路径。
要在图形中转换依赖项解析,我们首先必须处理它作为字符串出现的事实。你想得到这个:
'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
看起来像这样:
[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
通过这种方式,您可以将元组列表提供给来自 networkx 模块的图形构造函数,该构造函数将分析列表并为您构建图形,并为您提供一个简洁的方法来计算元组的长度两个给定节点之间的最短路径。
必要进口
import re
import networkx as nx
from practnlptools.tools import Annotator
如何以所需的元组列表格式获取字符串
annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
m = pattern.search(dep)
edges.append((m.group(1), m.group(2)))
如何构建图表
graph = nx.Graph(edges) # Well that was easy
如何计算最短路径长度
print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
此脚本将显示给定依赖项解析的最短路径实际上长度为 2,因为您可以通过 remind-8
[=22 从 Robots-1
到 awesomeness-12
=]
1. xsubj(remind-8, Robots-1)
2. prep_of(remind-8, awesomeness-12)
如果您不喜欢这个结果,您可能需要考虑过滤一些依赖项,在这种情况下不允许将 xsubj
依赖项添加到图中。
HugoMailhot 的 is great. I'll write something similar for spacy users who want to find the shortest dependency path between two words (whereas HugoMailhot's answer relies on practNLPTools).
句子:
Robots in popular culture are there to remind us of the awesomeness of
unbound human agency.
下面是查找两个词之间最短依赖路径的代码:
import networkx as nx
import spacy
nlp = spacy.load('en')
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
print('document: {0}'.format(document))
# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
# FYI https://spacy.io/docs/api/token
for child in token.children:
edges.append(('{0}-{1}'.format(token.lower_,token.i),
'{0}-{1}'.format(child.lower_,child.i)))
graph = nx.Graph(edges)
# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
输出:
4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
要安装 spacy 和 networkx:
sudo pip install networkx
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB
关于 spacy 的依赖解析的一些基准:https://spacy.io/docs/api/
这个答案依赖于Stanford CoreNLP来获取一个句子的依赖树。它在使用 networkx 时从 HugoMailhot 的 中借用了相当多的代码。
在运行编码之前,需要:
sudo pip install pycorenlp
(python 斯坦福 CoreNLP 接口)
- 下载Stanford CoreNLP
按如下方式启动 Stanford CoreNLP 服务器:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
那么可以运行下面的代码求出两个词之间的最短依赖路径:
import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint
nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
output = nlp.annotate(text, properties={
"timeout": "10000",
"ssplit.newlineIsSentenceBreak": "two",
'annotators': annotators,
'outputFormat': 'json'
})
return output
# The code expects the document to contains exactly one sentence.
document = 'Robots in popular culture are there to remind us of the awesomeness of'\
'unbound human agency.'
print('document: {0}'.format(document))
# Parse the text
annotations = get_stanford_annotations(document, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']
# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
edges.append((edge['governor'], edge['dependent']))
dependencies[(min(edge['governor'], edge['dependent']),
max(edge['governor'], edge['dependent']))] = edge
graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))
# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
if token1 == token['originalText']:
token1_index = token['index']
if token2 == token['originalText']:
token2_index = token['index']
path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))
for token_id in path:
token = tokens[token_id-1]
token_text = token['originalText']
print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
输出为:
document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1 token_text: Robots
Node 5 token_text: are
Node 8 token_text: remind
Node 12 token_text: awesomeness
注意可以在线测试Stanford CoreNLP:http://nlp.stanford.edu:8080/parser/index.jsp
此答案已在 Windows 7 SP1 x64 Ultimate 上使用 Stanford CoreNLP 3.6.0.、pycorenlp 0.3.0 和 python 3.5 x64 进行测试。
我试图在 Python 给定的依赖树中找到两个单词之间的依赖路径。
换句
Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
我使用 practnlptools (https://github.com/biplab-iitb/practNLPTools) 得到依赖解析结果如下:
nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)
也可以形象化为(图片取自https://demos.explosion.ai/displacy/)
"robots"和"are"之间的路径长度为1,"robots"和"awesomeness"之间的路径长度为4。
我的问题是上面的依赖解析结果,如何获取两个单词之间的依赖路径或依赖路径长度?
根据我当前的搜索结果,nltk 的 ParentedTree 有帮助吗?
谢谢!
你的问题很容易被认为是一个图形问题,我们必须找到两个节点之间的最短路径。
要在图形中转换依赖项解析,我们首先必须处理它作为字符串出现的事实。你想得到这个:
'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
看起来像这样:
[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
通过这种方式,您可以将元组列表提供给来自 networkx 模块的图形构造函数,该构造函数将分析列表并为您构建图形,并为您提供一个简洁的方法来计算元组的长度两个给定节点之间的最短路径。
必要进口
import re
import networkx as nx
from practnlptools.tools import Annotator
如何以所需的元组列表格式获取字符串
annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
m = pattern.search(dep)
edges.append((m.group(1), m.group(2)))
如何构建图表
graph = nx.Graph(edges) # Well that was easy
如何计算最短路径长度
print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
此脚本将显示给定依赖项解析的最短路径实际上长度为 2,因为您可以通过 remind-8
[=22 从 Robots-1
到 awesomeness-12
=]
1. xsubj(remind-8, Robots-1)
2. prep_of(remind-8, awesomeness-12)
如果您不喜欢这个结果,您可能需要考虑过滤一些依赖项,在这种情况下不允许将 xsubj
依赖项添加到图中。
HugoMailhot 的
句子:
Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
下面是查找两个词之间最短依赖路径的代码:
import networkx as nx
import spacy
nlp = spacy.load('en')
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
print('document: {0}'.format(document))
# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
# FYI https://spacy.io/docs/api/token
for child in token.children:
edges.append(('{0}-{1}'.format(token.lower_,token.i),
'{0}-{1}'.format(child.lower_,child.i)))
graph = nx.Graph(edges)
# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
输出:
4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
要安装 spacy 和 networkx:
sudo pip install networkx
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB
关于 spacy 的依赖解析的一些基准:https://spacy.io/docs/api/
这个答案依赖于Stanford CoreNLP来获取一个句子的依赖树。它在使用 networkx 时从 HugoMailhot 的
在运行编码之前,需要:
sudo pip install pycorenlp
(python 斯坦福 CoreNLP 接口)- 下载Stanford CoreNLP
按如下方式启动 Stanford CoreNLP 服务器:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
那么可以运行下面的代码求出两个词之间的最短依赖路径:
import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint
nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
output = nlp.annotate(text, properties={
"timeout": "10000",
"ssplit.newlineIsSentenceBreak": "two",
'annotators': annotators,
'outputFormat': 'json'
})
return output
# The code expects the document to contains exactly one sentence.
document = 'Robots in popular culture are there to remind us of the awesomeness of'\
'unbound human agency.'
print('document: {0}'.format(document))
# Parse the text
annotations = get_stanford_annotations(document, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']
# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
edges.append((edge['governor'], edge['dependent']))
dependencies[(min(edge['governor'], edge['dependent']),
max(edge['governor'], edge['dependent']))] = edge
graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))
# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
if token1 == token['originalText']:
token1_index = token['index']
if token2 == token['originalText']:
token2_index = token['index']
path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))
for token_id in path:
token = tokens[token_id-1]
token_text = token['originalText']
print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
输出为:
document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1 token_text: Robots
Node 5 token_text: are
Node 8 token_text: remind
Node 12 token_text: awesomeness
注意可以在线测试Stanford CoreNLP:http://nlp.stanford.edu:8080/parser/index.jsp
此答案已在 Windows 7 SP1 x64 Ultimate 上使用 Stanford CoreNLP 3.6.0.、pycorenlp 0.3.0 和 python 3.5 x64 进行测试。