为什么 nltk 字数统计不同于使用正则表达式的字数统计?
Why does nltk word counting differs from word counting using a Regex?
问题
我们有两个 'versions' 来自 txt 文件 (https://www.gutenberg.org/files/2701/old/moby10b.txt) 的相同文本:
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
我想念的是为什么 nltk_text.vocab()['some_word'])
returns 比 len(re.findall(r'\b(some_word)\b', raw_text)))
少。
完整代码示例
import nltk
import re
with open('moby.txt', 'r') as f:
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))
print(nltk_text.vocab()['whale']) #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text)) #prints 906
如果你运行
for word in nltk_text.vocab():
if 'whale' in word.lower():
print(word)
然后您会看到一长串单词,例如
whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale
不计入whale
如果您使用正则表达式检查它们,那么您会看到它们被计为 whale
print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale')))
# prints 5
编辑:
使用这段代码,我发现很少有 nltk
和 regex
给出不同结果的情况
import nltk
import re
with open('Pulpit/moby10b.txt') as f:
raw_text = f.read()
# --- get all `whale` with few chars around (-4, +10)
word_length = len('whale')
words = []
# search first word at position 0
position = raw_text.find('whale', 0)
while position != -1:
# get word (with few chars around)
start = position - 4
end = position + word_length + 10
word = raw_text[start:end]
# add word to list
words.append(word)
# search next word at position `position+1`
position = raw_text.find('whale', position+1)
# --- test words with nltk and regex
for word in words:
nltk_text = nltk.Text(nltk.word_tokenize(word))
number_1 = nltk_text.vocab()['whale']
number_2 = len(re.findall(r'\b(?<!-)(whale)(?!-)\b', word))
if number_1 != number_2:
print(number_1, number_2, word)
print('-----')
结果:
1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as
-----
0 1 the whale's
flukes
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----
我展示了两种情况
whale--
双 -
nltk
计算在内,但 regex
不计算在内。
whale's\nhead
\n
在 whale's
和下一个单词之间
头
nltk
不算(但有 space 时才算)
而不是 \n
或当有 space after/before \n
) 但 regex
在任何情况下都会计算它。
发生这种情况的主要原因是 标记化 。 token 并不总是一个词,它是一个 NLP 概念,作者目前不会深入探讨。如果想要精确匹配一个词而不一定是一个标记,请使用 wordpunct_tokenize 而不是 word_tokenize。下面的示例代码。
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
nltk_text2 = nltk.Text(nltk.wordpunct_tokenize(raw_text))
print(nltk_text.vocab()['whale']) #782
print(nltk_text2.vocab()['whale']) #906
print(len(re.findall(r'whale', raw_text))) #906
建议进一步阅读 here
问题
我们有两个 'versions' 来自 txt 文件 (https://www.gutenberg.org/files/2701/old/moby10b.txt) 的相同文本:
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
我想念的是为什么 nltk_text.vocab()['some_word'])
returns 比 len(re.findall(r'\b(some_word)\b', raw_text)))
少。
完整代码示例
import nltk
import re
with open('moby.txt', 'r') as f:
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))
print(nltk_text.vocab()['whale']) #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text)) #prints 906
如果你运行
for word in nltk_text.vocab():
if 'whale' in word.lower():
print(word)
然后您会看到一长串单词,例如
whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale
不计入whale
如果您使用正则表达式检查它们,那么您会看到它们被计为 whale
print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale')))
# prints 5
编辑:
使用这段代码,我发现很少有 nltk
和 regex
给出不同结果的情况
import nltk
import re
with open('Pulpit/moby10b.txt') as f:
raw_text = f.read()
# --- get all `whale` with few chars around (-4, +10)
word_length = len('whale')
words = []
# search first word at position 0
position = raw_text.find('whale', 0)
while position != -1:
# get word (with few chars around)
start = position - 4
end = position + word_length + 10
word = raw_text[start:end]
# add word to list
words.append(word)
# search next word at position `position+1`
position = raw_text.find('whale', position+1)
# --- test words with nltk and regex
for word in words:
nltk_text = nltk.Text(nltk.word_tokenize(word))
number_1 = nltk_text.vocab()['whale']
number_2 = len(re.findall(r'\b(?<!-)(whale)(?!-)\b', word))
if number_1 != number_2:
print(number_1, number_2, word)
print('-----')
结果:
1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as
-----
0 1 the whale's
flukes
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----
我展示了两种情况
whale--
双-
nltk
计算在内,但regex
不计算在内。whale's\nhead
\n
在whale's
和下一个单词之间 头nltk
不算(但有 space 时才算) 而不是\n
或当有 space after/before\n
) 但regex
在任何情况下都会计算它。
发生这种情况的主要原因是 标记化 。 token 并不总是一个词,它是一个 NLP 概念,作者目前不会深入探讨。如果想要精确匹配一个词而不一定是一个标记,请使用 wordpunct_tokenize 而不是 word_tokenize。下面的示例代码。
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
nltk_text2 = nltk.Text(nltk.wordpunct_tokenize(raw_text))
print(nltk_text.vocab()['whale']) #782
print(nltk_text2.vocab()['whale']) #906
print(len(re.findall(r'whale', raw_text))) #906
建议进一步阅读 here