构建用于分析的单词计数器

Building a Word Counter for Analysis

我正在尝试构建一个类似于 wordcounter.net (https://wordcounter.net/) 的 Python 程序。 我有一个 excel 文件,其中一列包含要分析的文本。使用pandas和其他功能,我创建了一个词频计数器。

但是现在,我需要进一步修改以找到模式。

比如某文有“笑脸愁眉苦脸圆润小宝贝甜” 开心脸圆润悲伤脸圆润"

所以在这里,它应该能够追踪模式,例如 二字密度

.....

三字密度

.....

我也试过了:

for match in re.finditer(pattern, line):

但这又必须手动完成,我希望它能自动找到模式。

任何人都可以帮助解决这个问题吗?

text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'

d = {}
for s in text.split():
    d.setdefault(s, 0)
    d[s] += 1
out = {}
for k, v in d.items():
    out.setdefault(v, []).append(k)
for i in sorted(out.keys(), reverse=True):
    print(f'{i} word density:')
    print(f'\t{out[i]}')

输出

5 word density:
    ['face']
3 word density:
    ['mellow']
2 word density:
    ['Happy', 'sad']
1 word density:
    ['little', 'baby', 'sweet']

编辑2

from collections import Counter


def freq(lst, n):
    lstn = []
    for i in range(len(lst) - (n - 1)):
        lstn.append(" ".join([lst[i + x] for x in range(n)]))
    out = Counter(lstn)
    print(f'{n} word density:')
    for k, v in out.items():
        print(f'\t"{k}" {v}')


text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'
lst = text.split()

freq(lst, 2)
freq(lst, 3)

输出

2 word density:
    "Happy face" 2
    "face sad" 1
    "sad face" 2
    "face mellow" 3
    "mellow little" 1
    "little baby" 1
    "baby sweet" 1
    "sweet Happy" 1
    "face face" 1
    "mellow sad" 1
3 word density:
    "Happy face sad" 1
    "face sad face" 1
    "sad face mellow" 2
    "face mellow little" 1
    "mellow little baby" 1
    "little baby sweet" 1
    "baby sweet Happy" 1
    "sweet Happy face" 1
    "Happy face face" 1
    "face face mellow" 1
    "face mellow sad" 1
    "mellow sad face" 1