Stanfordnlp python - 句子拆分和其他简单功能

Question

我正在尝试使用 Stanford NLP 解析器将字符串拆分成句子，我使用了 Stanford NLP 提供的示例代码，但它给了我单词而不是句子。

这是示例输入：

"this is sample input. I want to split this text into a list of sentences. Please help"

这是我想要的输出：

["this is sample input.", "I want to split this text into a list of sentences.", "Please help"]

我做了什么：

NLTK sent_tokenizer();不拆分换行符，而且似乎不如 stanfordnlp
stanfordnlp split 很棒，但示例输出不在句子列表中

我听说有一个使用 stanfordnlp 库的 nltk 解析器，但我无法获得它的任何示例指南。

在这一点上，我很困惑，因为斯坦福 NLP 几乎没有详尽的 python 指南。此任务必须使用 python，因为我研究中的其他组件使用 python 来处理数据。请帮忙！谢谢。

示例代码：

import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(a)
for i, sentence in enumerate(doc.sentences):
    print(f"====== Sentence {i+1} tokens =======")
    print(*[f"index: {token.index.rjust(3)}\ttoken: {token.text}" for token in sentence.tokens], sep='\n')
print(doc.sentences.tokens.text[2])

输出：

====== Sentence 84 tokens =======
index:   1  token: Retweet
index:   2  token: 10
index:   3  token: Like
index:   4  token: 83
index:   5  token: End
index:   6  token: of
index:   7  token: conversation
index:   8  token: ©
index:   9  token: 2019
index:  10  token: Twitter
index:  11  token: About
index:  12  token: Help
index:  13  token: Center
index:  14  token: Terms
index:  15  token: Privacy
index:  16  token: policy
====== Sentence 85 tokens =======
index:   1  token: Cookies
index:   2  token: Ads
index:   3  token: info

来源：https://stanfordnlp.github.io/stanfordnlp/pipeline.html

Answer 1

我会使用正常的 split('.')，但如果句子以 ? 或 ! 等结尾，它将不起作用。它需要 regex 但它仍然可以将句子内的 ... 视为三个句子的结尾。

使用 stanfordnlp 我只能连接句子中的单词，所以它把句子作为一个字符串给出，但是这个简单的方法在 ,.?! 之前添加了空格，等等

import stanfordnlp

text = "this is ... sample input. I want to split this text into a list of sentences. Can you? Please help"

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(text)

for i, sentence in enumerate(doc.sentences):
    sent = ' '.join(word.text for word in sentence.words)
    print(sent)

结果

this is ... sample input .
I want to split this text into a list of sentences .
Can you ?
Please help

也许在源代码中它可以找到它如何将文本拆分成句子并使用它。

Stanfordnlp python - 句子拆分和其他简单功能

Stanfordnlp python - sentence split and other simple functionality

python

stanford-nlp