在共指解析中获取 stanfordNLP 输出中的字符位置
Getting character positions in outputs of stanfordNLP in coreference resolution
我正在尝试使用 stanfordNLP 进行共指解析,正如所解释的那样 here. I'm running the code of above (provided ):
from stanfordnlp.server import CoreNLPClient
text = 'Barack was born in Hawaii. His wife Michelle was born in Milan. He says that she is very smart.'
print(f"Input text: {text}")
# set up the client
client = CoreNLPClient(properties={'annotators': 'coref', 'coref.algorithm' : 'statistical'}, timeout=60000, memory='16G')
# submit the request to the server
ann = client.annotate(text)
mychains = list()
chains = ann.corefChain
for chain in chains:
mychain = list()
# Loop through every mention of this chain
for mention in chain.mention:
# Get the sentence in which this mention is located, and get the words which are part of this mention
# (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
words_list = ann.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
#build a string out of the words of this mention
ment_word = ' '.join([x.word for x in words_list])
mychain.append(ment_word)
mychains.append(mychain)
for chain in mychains:
print(' <-> '.join(chain))
安装库后:
pip3 install stanfordcorenlp
正在下载模型,
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
并设置 $CORENLP_HOME 变量,
os.environ['CORENLP_HOME'] = "path/to/stanford-corenlp-full-2018-10-05"
这段代码对我来说工作得很好,但是,输出只包含标记而不是字符的信息。例如,对于上面的代码,输出是:
Barack <-> His <-> He
His wife Michelle <-> she
打印搭扣里面的变量提到是:
mentionID: 0
mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 0
position: 1
mentionID: 4
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 1
position: 3
mentionID: 5
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 2
position: 1
mentionID: 3
mentionType: "PROPER"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 3
headIndex: 2
sentenceIndex: 1
position: 2
mentionID: 6
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 3
endIndex: 4
headIndex: 3
sentenceIndex: 2
position: 2
我正在搜索其他属性,例如打印 ann.mentionsForCoref、
mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
person: "UNKNOWN"
startIndex: 0
endIndex: 1
headIndex: 0
headString: "barack"
nerString: "PERSON"
originalRef: 4294967295
goldCorefClusterID: -1
corefClusterID: 5
mentionNum: 0
sentNum: 0
utter: 0
paragraph: 1
isSubject: false
isDirectObject: true
isIndirectObject: false
isPrepositionObject: false
hasTwin: false
generic: false
isSingleton: false
hasBasicDependency: true
hasEnhancedDepenedncy: true
hasContextParseTree: true
尽管此属性提供了大量信息,但没有关于单词字符位置的信息。我可以用空格分割句子,但这并不普遍,我认为在某些情况下它可能会失败。有人可以帮我吗??
尝试在构建客户端时添加 output_format='json'
。 JSON 数据应该有每个标记的字符偏移信息。
这里有关于使用客户端的信息:
https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
我正在尝试使用 stanfordNLP 进行共指解析,正如所解释的那样 here. I'm running the code of above (provided
from stanfordnlp.server import CoreNLPClient
text = 'Barack was born in Hawaii. His wife Michelle was born in Milan. He says that she is very smart.'
print(f"Input text: {text}")
# set up the client
client = CoreNLPClient(properties={'annotators': 'coref', 'coref.algorithm' : 'statistical'}, timeout=60000, memory='16G')
# submit the request to the server
ann = client.annotate(text)
mychains = list()
chains = ann.corefChain
for chain in chains:
mychain = list()
# Loop through every mention of this chain
for mention in chain.mention:
# Get the sentence in which this mention is located, and get the words which are part of this mention
# (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
words_list = ann.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
#build a string out of the words of this mention
ment_word = ' '.join([x.word for x in words_list])
mychain.append(ment_word)
mychains.append(mychain)
for chain in mychains:
print(' <-> '.join(chain))
安装库后:
pip3 install stanfordcorenlp
正在下载模型,
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
并设置 $CORENLP_HOME 变量,
os.environ['CORENLP_HOME'] = "path/to/stanford-corenlp-full-2018-10-05"
这段代码对我来说工作得很好,但是,输出只包含标记而不是字符的信息。例如,对于上面的代码,输出是:
Barack <-> His <-> He
His wife Michelle <-> she
打印搭扣里面的变量提到是:
mentionID: 0
mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 0
position: 1
mentionID: 4
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 1
position: 3
mentionID: 5
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 2
position: 1
mentionID: 3
mentionType: "PROPER"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 3
headIndex: 2
sentenceIndex: 1
position: 2
mentionID: 6
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 3
endIndex: 4
headIndex: 3
sentenceIndex: 2
position: 2
我正在搜索其他属性,例如打印 ann.mentionsForCoref、
mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
person: "UNKNOWN"
startIndex: 0
endIndex: 1
headIndex: 0
headString: "barack"
nerString: "PERSON"
originalRef: 4294967295
goldCorefClusterID: -1
corefClusterID: 5
mentionNum: 0
sentNum: 0
utter: 0
paragraph: 1
isSubject: false
isDirectObject: true
isIndirectObject: false
isPrepositionObject: false
hasTwin: false
generic: false
isSingleton: false
hasBasicDependency: true
hasEnhancedDepenedncy: true
hasContextParseTree: true
尽管此属性提供了大量信息,但没有关于单词字符位置的信息。我可以用空格分割句子,但这并不普遍,我认为在某些情况下它可能会失败。有人可以帮我吗??
尝试在构建客户端时添加 output_format='json'
。 JSON 数据应该有每个标记的字符偏移信息。
这里有关于使用客户端的信息:
https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html