使用 Python 计算大文本中多词术语的频率

Question

我有一本字典，其中包含将近一百万个多词术语（包含空格的术语）。这看起来像

[..., 
'multilayer ceramic', 
'multilayer ceramic capacitor', 
'multilayer optical disk', 
'multilayer perceptron', 
...]

我想计算它们在许多千兆字节的文本中出现的频率。

作为一个小例子，考虑计算维基百科页面中的这四个多词表达：

payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'}
r = requests.get('https://en.wikipedia.org/w/api.php', params=payload)
sampletext = r.json()['query']['pages']['9221221']['extract'].lower()
sampledict = ['multilayer ceramic', 'multilayer ceramic capacitor', 'multilayer optical disk', 'multilayer perceptron']

termfreqdic = {}
for term in sampledict:
    termfreqdic[term] = sampletext.count(term)
print(termfreqdic)

这给出了类似 {'multilayer ceramic': 7, 'multilayer ceramic capacitor': 2, 'multilayer optical disk': 0, 'multilayer perceptron': 0} 的结果，但如果字典包含一百万个条目，它似乎不是最佳选择。

我试过非常大的正则表达式：

termlist = [re.escape(w) for w in open('termlistfile.txt').read().strip().split('\n')]
termregex = re.compile(r'\b'+r'\b|\b'.join(termlist), re.I)
termfreqdic = {}
for i,li in enumerate(open(f)):
    for m in termregex.finditer(li):
        termfreqdic[m.group(0)]=termfreqdic.get(m.group(0),0)+1
open('counted.tsv','w').write('\n'.join([a+'\t'+v for a,v in termfreqdic.items()]))

这太慢了（最近的 i7 上 1000 行文本需要 6 分钟）。但是如果我通过替换前两行来使用 regex 而不是 re，它会下降到每 1000 行文本大约 12 秒，这对我的需要来说仍然很慢：

termlist = open(termlistfile).read().strip().split('\n')
termregex = regex.compile(r"\L<options>", options=termlist)
...

请注意，这并不完全符合我的要求，因为一个术语可能是另一个术语的子术语，如示例 'multilayer ceramic' 和 'multilayer ceramic capacitor' （这也排除了第一个标记化的方法，如 ).

这看起来像是序列匹配的常见问题，在文本语料库或遗传字符串中，必须有众所周知的解决方案。也许可以用 trie 个单词来解决（我不介意术语表的初始编译速度很慢）？ las，我似乎没有在寻找合适的条款。也许有人可以指出我正确的方向？

Answer 1

下面给出的 NLTK 方法效果相对较好。作者无法重现相同的 sampledict，因此为了本练习，它是从 sampletext 创建的。 注意：提问者给出的方法花费了大约 60 倍的时间。

源数据：

#Invoke libraries
import nltk
import requests
import timeit
import pandas as pd

#Souce sample data
payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'}
r = requests.get('https://en.wikipedia.org/w/api.php', params=payload)
sampletext = r.json()['query']['pages']['9221221']['extract'].lower()
sampledict = sampletext.split(' ')

时间旧方法：

start = timeit.default_timer()
termfreqdic = {}
for term in sampledict:
    termfreqdic[term] = sampletext.count(term)
stop = timeit.default_timer()
timetaken = stop-start
stop - start 
#0.42748349941757624

NLTK 方法时间：

start = timeit.default_timer()
wordFreq = nltk.FreqDist(sampledict)
stop = timeit.default_timer()
timetaken = stop-start
stop - start 
#0.00713308053673245

通过将频率分布转换为数据帧来访问数据

wordFreqDf = pd.DataFrame(list(wordFreq.items()), columns = ["Word","Frequency"])

#Inspect data
wordFreqDf.head(10)

#output
#                     Word  Frequency
#0              60384-8/21          1
#1                 limited          2
#2                               3618
#3           comparatively          1
#4              code/month          1
#5                    four          1
#6   (microfarads):\n\nµ47          1
#7                consists          1
#8  α\n\t\t\n\t\t\n\n\n===          1

Answer 2

@SidharthMacherla 让我走上正轨（NLTK 和标记化），尽管他的解决方案没有解决多词表达的问题，而且，这可能会重叠。

简而言之，我发现的最佳方法是子类化 NLTK 的 MWETokenizer 并使用 util.Trie:

添加一个用于计算多词的函数

import re, regex, timeit
from nltk.tokenize import MWETokenizer
from nltk.util import Trie

class FreqMWETokenizer(MWETokenizer):
    """A tokenizer that processes tokenized text and merges multi-word expressions
    into single tokens.
    """

    def __init__(self, mwes=None, separator="_"):
        super().__init__(mwes, separator)

    def freqs(self, text):
        """
        :param text: A list containing tokenized text
        :type text: list(str)
        :return: A frequency dictionary with multi-words merged together as keys
        :rtype: dict
        :Example:
        >>> tokenizer = FreqMWETokenizer([ mw.split() for mw in ['multilayer ceramic', 'multilayer ceramic capacitor', 'ceramic capacitor']], separator=' ')
        >>> tokenizer.freqs("Gimme that multilayer ceramic capacitor please!".split())
        {'multilayer ceramic': 1, 'multilayer ceramic capacitor': 1, 'ceramic capacitor': 1}
        """
        i = 0
        n = len(text)
        result = {}

        while i < n:
            if text[i] in self._mwes:
                # possible MWE match
                j = i
                trie = self._mwes
                while j < n and text[j] in trie:
                    if Trie.LEAF in trie:
                        # success!
                        mw = self._separator.join(text[i:j])
                        result[mw]=result.get(mw,0)+1
                    trie = trie[text[j]]
                    j = j + 1
                else:
                    if Trie.LEAF in trie:
                        # success!
                        mw = self._separator.join(text[i:j])
                        result[mw]=result.get(mw,0)+1
                    i += 1
            else:
                i += 1

        return result

>>> tokenizer = FreqMWETokenizer([ mw.split() for mw in ['multilayer ceramic', 'multilayer ceramic capacitor', 'ceramic capacitor']], separator=' ')
>>> tokenizer.freqs("Gimme that multilayer ceramic capacitor please!".split())
{'multilayer ceramic': 1, 'multilayer ceramic capacitor': 1, 'ceramic capacitor': 1}

这是带有速度测量的测试套件：

使用 FreqMWETokenizer 计算 1000 万个字符中的 10k 个多词术语需要 2 秒，使用 MWETokenizer 需要 4 秒（还提供了完整的标记化，但不计算重叠），使用简单计数方法需要 150 秒，以及 1000 秒有一个大的正则表达式。在 100m 个字符中尝试 100k 个多词术语仍然可以使用分词器而不是计数或正则表达式。

为了测试，请在 https://mega.nz/file/PsVVWSzA#5-OHy-L7SO6fzsByiJzeBnAbtJKRVy95YFdjeF_7yxA

找到两个大样本文件


def freqtokenizer(thissampledict, thissampletext):
    """
    This method uses the above FreqMWETokenizer's function freqs.
    It captures overlapping multi-words

    counting 1000 terms in 1000000 characters took 0.3222855870008061 seconds. found 0 terms from the list.
    counting 10000 terms in 10000000 characters took 2.5309120759993675 seconds. found 21 terms from the list.
    counting 100000 terms in 29467534 characters took 10.57763242800138 seconds. found 956 terms from the list.
    counting 743274 terms in 29467534 characters took 25.613067482998304 seconds. found 10411 terms from the list.
    """
    tokenizer = FreqMWETokenizer([mw.split() for mw in thissampledict], separator=' ')
    thissampletext = re.sub('  +',' ', re.sub('[^\s\w\/\-\']+',' ',thissampletext)) # removing punctuation except /-'_
    freqs = tokenizer.freqs(thissampletext.split())
    return freqs


def nltkmethod(thissampledict, thissampletext):
    """ This function first produces a tokenization by means of MWETokenizer.
    This takes the biggest matching multi-word, no overlaps.
    They could be computed separately on the dictionary.

    counting 1000 terms in 1000000 characters took 0.34804968100070255 seconds. found 0 terms from the list.
    counting 10000 terms in 10000000 characters took 3.9042628339993826 seconds. found 20 terms from the list.
    counting 100000 terms in 29467534 characters took 12.782784996001283 seconds. found 942 terms from the list.
    counting 743274 terms in 29467534 characters took 28.684293715999956 seconds. found 9964 terms from the list.

    """
    termfreqdic = {}
    tokenizer = MWETokenizer([mw.split() for mw in thissampledict], separator=' ')
    thissampletext = re.sub('  +',' ', re.sub('[^\s\w\/\-\']+',' ',thissampletext)) # removing punctuation except /-'_
    tokens = tokenizer.tokenize(thissampletext.split())
    freqdist = FreqDist(tokens)
    termsfound = set([t for t in freqdist.keys()]) & set(thissampledict)
    for t in termsfound:termfreqdic[t]=freqdist[t]  
    return termfreqdic

def countmethod(thissampledict, thissampletext):
    """
    counting 1000 in 1000000 took 0.9351876619912218 seconds.
    counting 10000 in 10000000 took 91.92642056700424 seconds.
    counting 100000 in 29467534 took 3185.7411157219904 seconds.
    """
    termfreqdic = {}
    for term in thissampledict:
        termfreqdic[term] = thissampletext.count(term)
    return termfreqdic

def regexmethod(thissampledict, thissampletext):
    """
    counting 1000 terms in 1000000 characters took 2.298602456023218 seconds.
    counting 10000 terms in 10000000 characters took 395.46084802100086 seconds.
    counting 100000: impossible
    """
    termfreqdic = {}
    termregex = re.compile(r'\b'+r'\b|\b'.join(thissampledict))
    for m in termregex.finditer(thissampletext):
        termfreqdic[m.group(0)]=termfreqdic.get(m.group(0),0)+1
    return termfreqdic

def timing():
    """
    for testing, find the two large sample files at
    https://mega.nz/file/PsVVWSzA#5-OHy-L7SO6fzsByiJzeBnAbtJKRVy95YFdjeF_7yxA
    """
    sampletext=open("G06K0019000000.txt").read().lower()
    sampledict=open("manyterms.lower.txt").read().strip().split('\n')
    print(len(sampletext),'characters',len(sampledict),'terms')

    for i in range(4):
        for f in [freqtokenizer, nltkmethod, countmethod, regexmethod]:
            start = timeit.default_timer()
            thissampledict = sampledict[:1000*10**i] 
            thissampletext = sampletext[:1000000*10**i]

            termfreqdic = f(thissampledict, thissampletext)
            #termfreqdic = countmethod(thissampledict, thissampletext)
            #termfreqdic = regexmethod(thissampledict, thissampletext)
            #termfreqdic = nltkmethod(thissampledict, thissampletext)
            #termfreqdic = freqtokenizer(thissampledict, thissampletext)

            print('{f} counting {terms} terms in {characters} characters took {seconds} seconds. found {termfreqdic} terms from the list.'.format(f=f, terms=len(thissampledict), characters=len(thissampletext), seconds=timeit.default_timer()-start, termfreqdic=len({a:v for (a,v) in termfreqdic.items() if v})))

timing()

使用 Python 计算大文本中多词术语的频率

Count frequency of multi-word terms in large texts with Python

python

nlp

corpus

nltk

word-frequency