句子分词器检索跨度
Sentence tokenizer retrieve spans
我想检索基本 ntlk
句子分词器的跨度(我知道使用 pst 分词器是可行的,但基本分词器做得更好)。 运行 span_tokenize
方法可以 sent_tokenize
吗?
from nltk import sent_tokenize
sentences = nltk.sent_tokenize(text)
假设您需要单词跨度。
from nltk.tokenize import WhitespaceTokenizer as wt
from nltk import sent_tokenize
sentences = sent_tokenize("This is a sentence. This is another sentence. The sky is blue.")
print(list(wt().span_tokenize_sents(sentences)))
输出:
[[(0, 4), (5, 7), (8, 9), (10, 19)], [(0, 4), (5, 7), (8, 15), (16, 25)], [(0, 3), (4, 7), (8, 10), (11, 16)]]
参见https://www.nltk.org/api/nltk.tokenize.html。搜索 span_tokenize_sents.
对于句子跨度,您可以使用 nltk.tokenize.punkt.PunktSentenceTokenizer
中的 span_tokenize()
:https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer。
以下代码
from nltk.tokenize.punkt import PunktSentenceTokenizer as pt
full_text = "This is your text. You will split it into sentences. And get their spans."
spans = list(pt().span_tokenize(full_text))
print(spans)
会给你输出:
[(0, 18), (19, 52), (53, 73)]
我想检索基本 ntlk
句子分词器的跨度(我知道使用 pst 分词器是可行的,但基本分词器做得更好)。 运行 span_tokenize
方法可以 sent_tokenize
吗?
from nltk import sent_tokenize
sentences = nltk.sent_tokenize(text)
假设您需要单词跨度。
from nltk.tokenize import WhitespaceTokenizer as wt
from nltk import sent_tokenize
sentences = sent_tokenize("This is a sentence. This is another sentence. The sky is blue.")
print(list(wt().span_tokenize_sents(sentences)))
输出:
[[(0, 4), (5, 7), (8, 9), (10, 19)], [(0, 4), (5, 7), (8, 15), (16, 25)], [(0, 3), (4, 7), (8, 10), (11, 16)]]
参见https://www.nltk.org/api/nltk.tokenize.html。搜索 span_tokenize_sents.
对于句子跨度,您可以使用 nltk.tokenize.punkt.PunktSentenceTokenizer
中的 span_tokenize()
:https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer。
以下代码
from nltk.tokenize.punkt import PunktSentenceTokenizer as pt
full_text = "This is your text. You will split it into sentences. And get their spans."
spans = list(pt().span_tokenize(full_text))
print(spans)
会给你输出:
[(0, 18), (19, 52), (53, 73)]