如何在 python 或 R 中获取最常见的短语或单词
how to get most common phrases or words in python or R
给定一些文本,我怎样才能得到 n=1 到 6 中最常见的 n-gram?
我已经看到了获取 3 克或 2 克的方法,一次一个 n,但是有没有什么方法可以提取最有意义的最大长度短语,以及其他所有短语?
例如,在本文中仅用于演示目的:
fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.
n-gram 及其计数器的理想结果是:
fri evening commute: 3,
off-peak: 2,
rest of the words: 1
任何建议表示赞赏。谢谢
Python
考虑 NLTK 库,它提供了一个 ngrams 函数,您可以使用该函数迭代 n 的值。
A rough 实现将遵循以下内容,其中 rough 是此处的关键字:
from nltk import ngrams
from collections import Counter
result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')
for n in range(len(sentence.split(' ')), 1, -1):
phrases = []
for token in ngrams(sentence.split(), n):
phrases.append(' '.join(token))
phrase, freq = Counter(phrases).most_common(1)[0]
if freq > 1:
result.append((phrase, n))
sentence = sentence.replace(phrase, '')
for phrase, freq in result:
print('%s: %d' % (phrase, freq))
至于R
如果您打算使用 R,我建议您这样做:https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html
给定一些文本,我怎样才能得到 n=1 到 6 中最常见的 n-gram? 我已经看到了获取 3 克或 2 克的方法,一次一个 n,但是有没有什么方法可以提取最有意义的最大长度短语,以及其他所有短语?
例如,在本文中仅用于演示目的:
fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.
n-gram 及其计数器的理想结果是:
fri evening commute: 3,
off-peak: 2,
rest of the words: 1
任何建议表示赞赏。谢谢
Python
考虑 NLTK 库,它提供了一个 ngrams 函数,您可以使用该函数迭代 n 的值。
A rough 实现将遵循以下内容,其中 rough 是此处的关键字:
from nltk import ngrams
from collections import Counter
result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')
for n in range(len(sentence.split(' ')), 1, -1):
phrases = []
for token in ngrams(sentence.split(), n):
phrases.append(' '.join(token))
phrase, freq = Counter(phrases).most_common(1)[0]
if freq > 1:
result.append((phrase, n))
sentence = sentence.replace(phrase, '')
for phrase, freq in result:
print('%s: %d' % (phrase, freq))
至于R
如果您打算使用 R,我建议您这样做:https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html