如何计算许多列表中的 n-gram 出现次数
How do I count n-gram occurrences in many lists
有谁知道是否可以从一个 n 克的词汇表中计算出来,这些词在几个不同的标记列表中分别出现了多少次?词汇表由列表中的 n 克组成,其中每个唯一的 n 克被列出一次。如果我有:
列表
['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] //1
['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] //2
<type = list>
N-gram 词汇
('hello','I')
('I', 'am')
('am', 'doing')
('doing', 'okay')
('okay','are')
('hello', 'how')
('how', 'are')
('are','you')
('you', 'doing')
('doing', 'today')
('today', 'are')
('you', 'okay')
<type = tupels>
然后我希望输出类似于:
列表 1:
('hello', 'how')1
('how', 'are')1
('are','you')2
('you', 'doing')1
('doing', 'today')1
('today', 'are')1
('you', 'okay')1
列表 2:
('hello','I')1
('I', 'am')1
('am', 'doing')1
('doing', 'okay')1
('okay','are')1
('are','you')1
('you', 'okay')1
我有以下代码:
test_tokenized = [word_tokenize(i) for i in test_lower]
for test_toke in test_tokenized:
filtered_words = [word for word in test_toke if word not in stopwords.words('english')]
bigram = bigrams(filtered_words)
fdist = nltk.FeatDict(bigram)
for k,v in fdist.items():
#print (k,v)
occur = (k,v)
我建议使用范围为的 for 循环:
from collections import Counter
list1 = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
list2 = ['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay']
def ngram(li):
result = []
for i in range(len(li)-1):
result.append((li[i], li[i+1]))
return Counter(result)
print(ngram(list1))
print(ngram(list2))
使用列表理解生成 ngram 并collections.Counter
计算重复项:
from collections import Counter
l = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
ngrams = [(l[i],l[i+1]) for i in range(len(l)-1)]
print Counter(ngrams)
有谁知道是否可以从一个 n 克的词汇表中计算出来,这些词在几个不同的标记列表中分别出现了多少次?词汇表由列表中的 n 克组成,其中每个唯一的 n 克被列出一次。如果我有:
列表
['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] //1
['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] //2
<type = list>
N-gram 词汇
('hello','I')
('I', 'am')
('am', 'doing')
('doing', 'okay')
('okay','are')
('hello', 'how')
('how', 'are')
('are','you')
('you', 'doing')
('doing', 'today')
('today', 'are')
('you', 'okay')
<type = tupels>
然后我希望输出类似于:
列表 1:
('hello', 'how')1
('how', 'are')1
('are','you')2
('you', 'doing')1
('doing', 'today')1
('today', 'are')1
('you', 'okay')1
列表 2:
('hello','I')1
('I', 'am')1
('am', 'doing')1
('doing', 'okay')1
('okay','are')1
('are','you')1
('you', 'okay')1
我有以下代码:
test_tokenized = [word_tokenize(i) for i in test_lower]
for test_toke in test_tokenized:
filtered_words = [word for word in test_toke if word not in stopwords.words('english')]
bigram = bigrams(filtered_words)
fdist = nltk.FeatDict(bigram)
for k,v in fdist.items():
#print (k,v)
occur = (k,v)
我建议使用范围为的 for 循环:
from collections import Counter
list1 = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
list2 = ['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay']
def ngram(li):
result = []
for i in range(len(li)-1):
result.append((li[i], li[i+1]))
return Counter(result)
print(ngram(list1))
print(ngram(list2))
使用列表理解生成 ngram 并collections.Counter
计算重复项:
from collections import Counter
l = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
ngrams = [(l[i],l[i+1]) for i in range(len(l)-1)]
print Counter(ngrams)