nltk word_tokenize returns 有序的单词？

Question

如果我运行下面的代码：

from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

我得到这个输出： ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

在这种情况下，列表中的标记出现的顺序与它们在输入句子中的顺序相同。

但是，它们总是与输入句子中的顺序相同吗？

Answer 1

是的，它们总是与输入句子中的顺序相同。

方法 word_tokenize 调用 re.findall。关于 re.findall 的正则表达式文档说明如下。

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

参考文献：
https://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize（在本页搜索 word_tokenize）
https://docs.python.org/3/library/re.html（在本页搜索 findall）
https://docs.python.org/2/library/re.html（在本页搜索 findall）

nltk word_tokenize returns 有序的单词？

nltk word_tokenize returns ordered words?

tokenize

nltk