为什么 nltk 字数统计不同于使用正则表达式的字数统计?

Why does nltk word counting differs from word counting using a Regex?

问题

我们有两个 'versions' 来自 txt 文件 (https://www.gutenberg.org/files/2701/old/moby10b.txt) 的相同文本:

我想念的是为什么 nltk_text.vocab()['some_word']) returns 比 len(re.findall(r'\b(some_word)\b', raw_text))) 少。

完整代码示例

import nltk
import re

with open('moby.txt', 'r') as f:
    raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))

print(nltk_text.vocab()['whale'])                    #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text))      #prints 906  

如果你运行

for word in nltk_text.vocab():
    if 'whale' in word.lower():
        print(word)

然后您会看到一长串单词,例如

whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale

不计入whale

如果您使用正则表达式检查它们,那么您会看到它们被计为 whale

print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale'))) 

# prints 5

编辑:

使用这段代码,我发现很少有 nltkregex 给出不同结果的情况

import nltk
import re

with open('Pulpit/moby10b.txt') as f:
    raw_text = f.read()

# --- get all `whale` with few chars around (-4, +10)

word_length = len('whale')
words = []

# search first word at position 0
position = raw_text.find('whale', 0)

while position != -1:
    # get word (with few chars around)
    start = position - 4
    end   = position + word_length + 10
    word  = raw_text[start:end]
    # add word to list
    words.append(word)
    # search next word at position `position+1`
    position = raw_text.find('whale', position+1)

# --- test words with nltk and regex

for word in words:

    nltk_text = nltk.Text(nltk.word_tokenize(word))
    number_1 = nltk_text.vocab()['whale']
    number_2 = len(re.findall(r'\b(?<!-)(whale)(?!-)\b', word))
    if number_1 != number_2:
        print(number_1, number_2, word)
        print('-----')

结果:

1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as 
-----
0 1 the whale's
flukes 
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----

我展示了两种情况

  1. whale---

    nltk 计算在内,但 regex 不计算在内。

  2. whale's\nhead \nwhale's 和下一个单词之间 头

    nltk 不算(但有 space 时才算) 而不是 \n 或当有 space after/before \n) 但 regex 在任何情况下都会计算它。

发生这种情况的主要原因是 标记化 。 token 并不总是一个词,它是一个 NLP 概念,作者目前不会深入探讨。如果想要精确匹配一个词而不一定是一个标记,请使用 wordpunct_tokenize 而不是 word_tokenize。下面的示例代码。

nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
nltk_text2 = nltk.Text(nltk.wordpunct_tokenize(raw_text))
print(nltk_text.vocab()['whale']) #782
print(nltk_text2.vocab()['whale']) #906
print(len(re.findall(r'whale', raw_text))) #906

建议进一步阅读 here