NLTK 中 PunktSentenceTokenizer 的使用

Use of PunktSentenceTokenizer in NLTK

我正在使用 NLTK 学习自然语言处理。 我遇到了使用 PunktSentenceTokenizer 的代码,在给定的代码中我无法理解其实际用途。给出代码:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
try:
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)

except Exception as e:
    print(str(e))


process_content()

那么,我们为什么要使用 PunktSentenceTokenizer。标记为A和B的行中发生了什么。我的意思是有一个训练文本,另一个是示例文本,但是需要两个数据集才能获得词性标记。

标记为 AB 的行是我无法理解的。

PS : 我确实尝试查看 NLTK 书,但无法理解 PunktSentenceTokenizer 的真正用途是什么

PunktSentenceTokenizer是句子边界检测算法,必须经过训练才能使用[1]。 NLTK 已经包含预训练版本的 PunktSentenceTokenizer。

因此,如果您使用不带任何参数的初始化分词器,它将默认为预训练版本:

In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']

您也可以在使用前提供自己的训练数据来训练分词器。 Punkt tokenizer 使用无监督算法,这意味着您只需使用常规文本对其进行训练。

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

对于大多数情况,使用预训练版本完全没问题。因此,您可以在不提供任何参数的情况下简单地初始化分词器。

那么"what all this has to do with POS tagging"? NLTK 词性标注器适用于标记化的句子,因此您需要先将文本分解为句子和单词标记,然后才能进行词性标记。

NLTK's documentation.

[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"

PunktSentenceTokenizer 是 NLTK 中提供的默认句子分词器的摘要 class,即 sent_tokenize()。它是 无监督多语言句子的实现 边界检测(Kiss and Strunk (2005)。参见 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

给定一个包含多个句子的段落,例如:

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

您可以使用 sent_tokenize():

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

sent_tokenize() 使用来自 nltk_data/tokenizers/punkt/english.pickle 的预训练模型。您还可以指定其他语言,NLTK 中具有预训练模型的可用语言列表为:

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

给定另一种语言的文本,执行此操作:

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
---------

要训练您自己的 punkt 模型,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt

您可以参考下文 link 以更深入地了解 PunktSentenceTokenizer 的用法。 它生动地解释了为什么在您的案例中使用 PunktSentenceTokenizer 而不是 sent-tokenize()。

http://nlpforhackers.io/splitting-text-into-sentences/

def process_content(corpus):

    tokenized = PunktSentenceTokenizer().tokenize(corpus)

    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))

process_content(train_text)

甚至无需在其他文本数据上对其进行训练,它的工作方式与 pre-trained 相同。