如何在项目符号数据或列出的数据的情况下使用 nltk 句子分词器？

Question

我正在使用 nltk 句子分词器来获取文件的句子。
但是当有 bullets/listed 数据时，它会非常失败。

我使用的代码是：

dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", " ", fileContent)
sentences = nltk.sent_tokenize(fileContent)
print(sentences)

我希望句子分词器将每个项目符号作为一个句子给出。

有人可以帮我吗？谢谢！

编辑 1:
原始 ppt 样本：http://pastebin.com/dbwKCESg
处理后的ppt数据：http://pastebin.com/0N64krKC

我只会收到处理后的数据文件，需要对其进行句子标记化。

Answer 1

您的问题有点不清楚，但我尝试了您的代码，但在尝试解析项目符号时似乎失败了。我添加了一个函数来去除不可打印的字符，并添加了一个 find/replace 来用句点替换换行符。我的 python 版本的可打印字符串是：

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c

此代码使用项目符号创建句子，同时仍将句子从文本块中分离出来。如果输入文本中的句子中间有换行符，它将失败 - 您的示例输入没有。

import re, nltk, string

dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", ".", fileContent)
fileContentAscii = ''.join(filter(lambda x:x in string.printable,fileContent))
sentences = nltk.sent_tokenize(fileContentAscii)

如何在项目符号数据或列出的数据的情况下使用 nltk 句子分词器？

How to use nltk sentence tokenizer in case of bulleted-data or listed data?

python

nlp

nltk