在 python 中使用 coreNLP 的斯坦福类型依赖项
Stanford typed dependencies using coreNLP in python
在 Stanford Dependency Manual 中,他们提到了 "Stanford typed dependencies",特别是类型 "neg" - 否定修饰符。在网站上使用 Stanford enhanced++ 解析器时也可用。例如,句子:
"Barack Obama was not born in Hawaii"
解析器确实找到了 neg(born,not)
但是当我使用 stanfordnlp
python 库时,我唯一能得到的依赖解析器将解析句子如下:
('Barack', '5', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '5', 'aux:pass')
('not', '5', 'advmod')
('born', '0', 'root')
('in', '7', 'case')
('Hawaii', '5', 'obl')
以及生成它的代码:
import stanfordnlp
stanfordnlp.download('en')
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a = doc.sentences[0]
a.print_dependencies()
有没有一种方法可以获得与增强型依赖解析器或任何其他 Stanford 解析器类似的结果,这些解析器会产生类型化的依赖关系,从而给我否定修饰符?
我认为用于生成文档依赖关系的模型与在线提供的模型之间可能存在差异,因此存在差异。我会直接通过 GitHub issues.
向 stanfordnlp
库维护者提出问题
请注意 python 库 stanfordnlp 不仅仅是 StanfordCoreNLP 的 python 包装器。
1。 StanfordNLP / CoreNLP 的区别
The Stanford NLP Group's official Python NLP library. It contains
packages for running our latest fully neural pipeline from the CoNLL
2018 Shared Task and for accessing the Java Stanford CoreNLP server.
Stanfordnlp 包含一组新的神经网络模型,在 CONLL 2018 共享任务上进行了训练。在线解析器基于 CoreNLP 3.9.2 java 库。如 here 所述,这是两个不同的管道和模型集。
您的代码仅访问他们在 CONLL 2018 数据上训练的神经管道。这解释了您看到的与在线版本相比的差异。这些基本上是两种不同的模型。
我认为更令人困惑的是,这两个存储库都属于名为 stanfordnlp 的用户(这是团队名称)。不要在 java stanfordnlp/CoreNLP 和 python stanfordnlp/stanfordnlp 之间被愚弄。
关于您的 'neg' 问题,似乎在 python libabry stanfordnlp 中,他们决定完全考虑使用 'advmod' 注释的否定。至少这就是我 运行 的几个例句。
2。通过 stanfordnlp 包使用 CoreNLP
但是,您仍然可以通过 stanfordnlp 包访问 CoreNLP。不过,它还需要几个步骤。引用 Github 回购协议,
There are a few initial setup steps.
- Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)
- Put the model jars in the distribution folder
- Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05
完成后,您可以启动客户端,其代码可在 demo :
中找到
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
print('---')
print('dependency parse of first sentence')
dependency_parse = sentence.basicDependencies
print(dependency_parse)
#get the tokens of the first sentence
#note that 1 token is 1 node in the parse tree, nodes start at 1
print('---')
print('Tokens of first sentence')
for token in sentence.token :
print(token)
因此,如果您指定 'depparse' 注释器(以及先决条件注释器 tokenize、ssplit 和 pos),您的句子将被解析。
看了demo,感觉只能访问basicDependencies。我还没有设法通过 stanfordnlp 使 Enhanced++ 依赖项工作。
但是如果你使用 basicDependencies ,否定仍然会出现!
这是我使用 stanfordnlp 和你的例句获得的输出。它是一个 DependencyGraph 对象,并不漂亮,但不幸的是,当我们使用非常深入的 CoreNLP 工具时,情况总是如此。您会看到在节点 4 和节点 5('not' 和 'born')之间,存在边 'neg'。
node {
sentenceIndex: 0
index: 1
}
node {
sentenceIndex: 0
index: 2
}
node {
sentenceIndex: 0
index: 3
}
node {
sentenceIndex: 0
index: 4
}
node {
sentenceIndex: 0
index: 5
}
node {
sentenceIndex: 0
index: 6
}
node {
sentenceIndex: 0
index: 7
}
node {
sentenceIndex: 0
index: 8
}
edge {
source: 2
target: 1
dep: "compound"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 2
dep: "nsubjpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 3
dep: "auxpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 4
dep: "neg"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 7
dep: "nmod"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 8
dep: "punct"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 7
target: 6
dep: "case"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
root: 5
---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false
word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false
word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false
word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false
2。通过 NLTK 包使用 CoreNLP
我不会详细介绍这个,但如果其他方法都失败,也有一个通过 NLTK 库访问 CoreNLP 服务器的解决方案。它确实输出否定,但需要更多的工作来启动服务器。
this page
详情
编辑
我想我也可以与您分享代码,使 DependencyGraph 进入一个很好的 'dependency, argument1, argument2' 列表,其形状类似于 stanfordnlp 输出。
from stanfordnlp.server import CoreNLPClient
text = "Barack Obama was not born in Hawaii."
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
dependency_parse = sentence.basicDependencies
#print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
#print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
#print(dir(dependency_parse.edge))
#get a dictionary associating each token/node with its label
token_dict = {}
for i in range(0, len(sentence.token)) :
token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word
#get a list of the dependencies with the words they connect
list_dep=[]
for i in range(0, len(dependency_parse.edge)):
source_node = dependency_parse.edge[i].source
source_name = token_dict[source_node]
target_node = dependency_parse.edge[i].target
target_name = token_dict[target_node]
dep = dependency_parse.edge[i].dep
list_dep.append((dep,
str(source_node)+'-'+source_name,
str(target_node)+'-'+target_name))
print(list_dep)
输出如下
[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
offset = 0 # keeps track of token offset for each sentence
for sentence in ann.sentence:
print('___________________')
print('dependency parse:')
# extract dependency parse
dp = sentence.basicDependencies
# build a helper dict to associate token index and label
token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
offset += len(sentence.token)
# build list of (source, target) pairs
out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]
for source, target in out_parse:
print(source, token_dict[source], '->', target, token_dict[target])
print('\nTokens \t POS \t NER')
for token in sentence.token:
print (token.word, '\t', token.pos, '\t', token.ner)
第一句输出如下:
___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in
Tokens POS NER
Barack NNP PERSON
Obama NNP PERSON
was VBD O
born VBN O
in IN O
Hawaii NNP STATE_OR_PROVINCE
. . O
2021 年:
注意:运行此代码来自终端,由于某些标准输入兼容性问题,它无法在笔记本上运行。
import os
os.environ["CORENLP_HOME"] = "./stanford-corenlp-4.2.0"
import pandas as pd
from stanza.server import CoreNLPClient
另一种选择是 SpaCy ( https://spacy.io/api/dependencyparser )
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')
def printInfo(doc):
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_,
token.shape_, token.is_alpha,
token.is_stop, token.ent_type_, token.dep_, token.head.text)
doc = nlp("Barack Obama was not born in Hawaii")
printInfo(doc)
输出为:
Barack Barack PROPN NNP Xxxxx True False PERSON compound Obama
Obama Obama PROPN NNP Xxxxx True False PERSON nsubjpass born
was be AUX VBD xxx True True auxpass born
not not PART RB xxx True True neg born
born bear VERB VBN xxxx True False ROOT born
in in ADP IN xx True True prep born
Hawaii Hawaii PROPN NNP Xxxxx True False GPE pobj in
在 Stanford Dependency Manual 中,他们提到了 "Stanford typed dependencies",特别是类型 "neg" - 否定修饰符。在网站上使用 Stanford enhanced++ 解析器时也可用。例如,句子:
"Barack Obama was not born in Hawaii"
解析器确实找到了 neg(born,not)
但是当我使用 stanfordnlp
python 库时,我唯一能得到的依赖解析器将解析句子如下:
('Barack', '5', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '5', 'aux:pass')
('not', '5', 'advmod')
('born', '0', 'root')
('in', '7', 'case')
('Hawaii', '5', 'obl')
以及生成它的代码:
import stanfordnlp
stanfordnlp.download('en')
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a = doc.sentences[0]
a.print_dependencies()
有没有一种方法可以获得与增强型依赖解析器或任何其他 Stanford 解析器类似的结果,这些解析器会产生类型化的依赖关系,从而给我否定修饰符?
我认为用于生成文档依赖关系的模型与在线提供的模型之间可能存在差异,因此存在差异。我会直接通过 GitHub issues.
向stanfordnlp
库维护者提出问题
请注意 python 库 stanfordnlp 不仅仅是 StanfordCoreNLP 的 python 包装器。
1。 StanfordNLP / CoreNLP 的区别
The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.
Stanfordnlp 包含一组新的神经网络模型,在 CONLL 2018 共享任务上进行了训练。在线解析器基于 CoreNLP 3.9.2 java 库。如 here 所述,这是两个不同的管道和模型集。
您的代码仅访问他们在 CONLL 2018 数据上训练的神经管道。这解释了您看到的与在线版本相比的差异。这些基本上是两种不同的模型。
我认为更令人困惑的是,这两个存储库都属于名为 stanfordnlp 的用户(这是团队名称)。不要在 java stanfordnlp/CoreNLP 和 python stanfordnlp/stanfordnlp 之间被愚弄。
关于您的 'neg' 问题,似乎在 python libabry stanfordnlp 中,他们决定完全考虑使用 'advmod' 注释的否定。至少这就是我 运行 的几个例句。
2。通过 stanfordnlp 包使用 CoreNLP
但是,您仍然可以通过 stanfordnlp 包访问 CoreNLP。不过,它还需要几个步骤。引用 Github 回购协议,
There are a few initial setup steps.
- Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)
- Put the model jars in the distribution folder
- Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05
完成后,您可以启动客户端,其代码可在 demo :
中找到from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
print('---')
print('dependency parse of first sentence')
dependency_parse = sentence.basicDependencies
print(dependency_parse)
#get the tokens of the first sentence
#note that 1 token is 1 node in the parse tree, nodes start at 1
print('---')
print('Tokens of first sentence')
for token in sentence.token :
print(token)
因此,如果您指定 'depparse' 注释器(以及先决条件注释器 tokenize、ssplit 和 pos),您的句子将被解析。 看了demo,感觉只能访问basicDependencies。我还没有设法通过 stanfordnlp 使 Enhanced++ 依赖项工作。
但是如果你使用 basicDependencies ,否定仍然会出现!
这是我使用 stanfordnlp 和你的例句获得的输出。它是一个 DependencyGraph 对象,并不漂亮,但不幸的是,当我们使用非常深入的 CoreNLP 工具时,情况总是如此。您会看到在节点 4 和节点 5('not' 和 'born')之间,存在边 'neg'。
node {
sentenceIndex: 0
index: 1
}
node {
sentenceIndex: 0
index: 2
}
node {
sentenceIndex: 0
index: 3
}
node {
sentenceIndex: 0
index: 4
}
node {
sentenceIndex: 0
index: 5
}
node {
sentenceIndex: 0
index: 6
}
node {
sentenceIndex: 0
index: 7
}
node {
sentenceIndex: 0
index: 8
}
edge {
source: 2
target: 1
dep: "compound"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 2
dep: "nsubjpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 3
dep: "auxpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 4
dep: "neg"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 7
dep: "nmod"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 8
dep: "punct"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 7
target: 6
dep: "case"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
root: 5
---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false
word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false
word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false
word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false
2。通过 NLTK 包使用 CoreNLP
我不会详细介绍这个,但如果其他方法都失败,也有一个通过 NLTK 库访问 CoreNLP 服务器的解决方案。它确实输出否定,但需要更多的工作来启动服务器。 this page
详情编辑
我想我也可以与您分享代码,使 DependencyGraph 进入一个很好的 'dependency, argument1, argument2' 列表,其形状类似于 stanfordnlp 输出。
from stanfordnlp.server import CoreNLPClient
text = "Barack Obama was not born in Hawaii."
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
dependency_parse = sentence.basicDependencies
#print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
#print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
#print(dir(dependency_parse.edge))
#get a dictionary associating each token/node with its label
token_dict = {}
for i in range(0, len(sentence.token)) :
token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word
#get a list of the dependencies with the words they connect
list_dep=[]
for i in range(0, len(dependency_parse.edge)):
source_node = dependency_parse.edge[i].source
source_name = token_dict[source_node]
target_node = dependency_parse.edge[i].target
target_name = token_dict[target_node]
dep = dependency_parse.edge[i].dep
list_dep.append((dep,
str(source_node)+'-'+source_name,
str(target_node)+'-'+target_name))
print(list_dep)
输出如下
[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
offset = 0 # keeps track of token offset for each sentence
for sentence in ann.sentence:
print('___________________')
print('dependency parse:')
# extract dependency parse
dp = sentence.basicDependencies
# build a helper dict to associate token index and label
token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
offset += len(sentence.token)
# build list of (source, target) pairs
out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]
for source, target in out_parse:
print(source, token_dict[source], '->', target, token_dict[target])
print('\nTokens \t POS \t NER')
for token in sentence.token:
print (token.word, '\t', token.pos, '\t', token.ner)
第一句输出如下:
___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in
Tokens POS NER
Barack NNP PERSON
Obama NNP PERSON
was VBD O
born VBN O
in IN O
Hawaii NNP STATE_OR_PROVINCE
. . O
2021 年:
注意:运行此代码来自终端,由于某些标准输入兼容性问题,它无法在笔记本上运行。
import os
os.environ["CORENLP_HOME"] = "./stanford-corenlp-4.2.0"
import pandas as pd
from stanza.server import CoreNLPClient
另一种选择是 SpaCy ( https://spacy.io/api/dependencyparser )
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')
def printInfo(doc):
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_,
token.shape_, token.is_alpha,
token.is_stop, token.ent_type_, token.dep_, token.head.text)
doc = nlp("Barack Obama was not born in Hawaii")
printInfo(doc)
输出为:
Barack Barack PROPN NNP Xxxxx True False PERSON compound Obama
Obama Obama PROPN NNP Xxxxx True False PERSON nsubjpass born
was be AUX VBD xxx True True auxpass born
not not PART RB xxx True True neg born
born bear VERB VBN xxxx True False ROOT born
in in ADP IN xx True True prep born
Hawaii Hawaii PROPN NNP Xxxxx True False GPE pobj in