使用 nltk 中的标记集计算演讲中的不同单词

Question

我目前遇到了这个问题。

我的任务是实现一个函数，该函数 return 具有给定词性的不同单词的排序列表。我需要使用 NLTK 的 pos_tag_sents 和 NLTK 的分词器来计算特定的单词。

我有一个类似的问题，并且在 Stack Overflow 的其他用户的帮助下让它工作。并试图用同样的方法来解决这个问题。

这是我目前的代码：

import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

def pos_counts(text, pos_list):
    """Return the sorted list of distinct words with a given part of speech
    >>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
    >>> pos_counts(emma, ['DET', 'NOUN'])
    [14352, 32029] - expected result
    """

    text = nltk.word_tokenize(text)
    tempword = nltk.pos_tag_sents(text, tagset="universal")
    counts = nltk.FreqDist(tempword)

    return [counts[x] or 0 for x in pos_list]

有一个 doctest 应该给出以下结果：[14352, 32029]

我运行我的代码并收到此错误消息：

Error
**********************************************************************
File "C:/Users/PycharmProjects/a1/a1.py", line 29, in a1.pos_counts
Failed example:
    pos_counts(emma, ['DET', 'NOUN'])
Exception raised:
    Traceback (most recent call last):
      File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.4\helpers\pycharm\docrunner.py", line 140, in __run
        compileflags, 1), test.globs)
      File "<doctest a1.pos_counts[1]>", line 1, in <module>
        pos_counts(emma, ['DET', 'NOUN'])
      File "C:/Users/PycharmProjects/a1/a1.py", line 35, in pos_counts
        counts = nltk.FreqDist(tempword)
      File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 108, in __init__
        Counter.__init__(self, samples)
      File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 535, in __init__
        self.update(*args, **kwds)
      File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 146, in update
        super(FreqDist, self).update(*args, **kwargs)
      File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 622, in update
        _count_elements(self, iterable)
    TypeError: unhashable type: 'list'

我觉得我越来越接近了，但我不知道我做错了什么。

任何帮助将不胜感激。谢谢。

Answer 1

一种方法是这样的：

import nltk

def pos_count(text, pos_list):
    sents = nltk.tokenize.sent_tokenize(text)
    words = (nltk.word_tokenize(sent) for sent in sents)
    tagged = nltk.pos_tag_sents(words, tagset='universal')
    tags = [tag[1] for sent in tagged for tag in sent]
    counts = nltk.FreqDist(tag for tag in tags if tag in pos_list)
    return counts

nltk book 中对此进行了很好的解释。测试：

In [3]: emma = nltk.corpus.gutenberg.raw('austen-emma.txt')

In [4]: pos_count(emma, ['DET', 'NOUN'])
Out[4]: FreqDist({'DET': 14352, 'NOUN': 32029})

编辑：当您需要计算诸如词性标签之类的东西时，使用FreqDist 是个好主意。我不认为有一个函数 return 一个带有结果的简单列表是非常聪明的，原则上你怎么知道哪个数字代表哪个标签？

一个可能的（不好的）解决方案是 return FreqDist.values() 的排序列表。这样，结果将按照标签名称的字母顺序进行排序。如果你真的想要这个，请在上面的函数定义中将 return counts 替换为 return [item[1] for item in sorted(counts.items())]。

使用 nltk 中的标记集计算演讲中的不同单词

Counting distinct words in a speech using tagset in nltk

python

tokenize

nltk

pos-tagger