标记化文本中 ngram(字符串)的频率
Frequency of ngrams (strings) in tokenized text
我有一组独特的 ngram(称为 ngramlist 的列表)和 ngram 标记化文本(称为 ngrams 的列表)。我想构建一个新向量 freqlist,其中 freqlist 的每个元素都是 ngrams 的一部分,它等于 ngramlist 的那个元素。我写了下面的代码,给出了正确的输出,但我想知道是否有办法优化它:
freqlist = [
sum(int(ngram == ngram_condidate)
for ngram_condidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
我想 nltk 或其他地方有一个函数可以更快地执行此操作,但我不确定是哪个。
谢谢!
编辑:就其价值而言,ngram 是作为 nltk.util.ngrams 的联合输出生成的,而 ngramlist
只是由所有找到的 ngram 组成的列表。
编辑 2:
这是测试频率列表行的可重现代码(其余代码并不是我真正关心的)
from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd
articles = ['New York City','Moscow','Beijing']
tokenizer = nltk.tokenize.TreebankWordTokenizer()
data={'article':[],'treebank_tokenizer':[]}
for article in articles:
data['article' ].append(wikipedia.page(article).content)
data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))
df=pd.DataFrame(data)
df['ngrams-3']=df['treebank_tokenizer'].map(
lambda x: [' '.join(t) for t in ngrams(x,3)])
ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))
df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])
您可以通过 pre-computing 一些数量并使用 Counter
来稍微优化一下。如果 ngramlist
中的大多数元素都包含在 ngrams
.
中,这将特别有用
freqlist = [
sum(int(ngram == ngram_candidate)
for ngram_candidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
您当然不需要每次检查 ngram
时都遍历 ngrams
。一次通过 ngrams
将使该算法成为 O(n)
而不是您现在拥有的 O(n<sup>2</sup>)
。请记住,较短的代码不一定是更好或更高效的代码:
from collections import Counter
...
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
要正确使用此函数,您必须编写 def
函数而不是 lambda
:
def count_ngrams(ngrams):
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)
首先,不要通过覆盖它们并将它们用作变量来污染导入的函数,保留 ngrams
名称作为函数,并使用其他东西作为变量。
import time
from functools import partial
from itertools import chain
from collections import Counter
import wikipedia
import pandas as pd
from nltk import word_tokenize
from nltk.util import ngrams
接下来您在原始问题中询问的行之前的步骤可能有点低效,您可以清理它们,使它们更易于阅读和测量:
# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
然后:
# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
然后:
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
终于到了最后一行
# Instead of a set, we use a Counter here because
# we can use an intersection between Counter objects later.
# see
all_trigrams = Counter(chain(*df['trigrams']))
# More often than not, you don't need to keep all the
# zeros in the vectors (aka dense vector),
# you could actually get the non-zero sparse vectors
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)
# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
nonzero_features = Counter(list_of_ngrams) & all_trigrams
total = len(list_of_ngrams)
return {ng:count/total for ng, count in nonzero_features.items()}
df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)
我有一组独特的 ngram(称为 ngramlist 的列表)和 ngram 标记化文本(称为 ngrams 的列表)。我想构建一个新向量 freqlist,其中 freqlist 的每个元素都是 ngrams 的一部分,它等于 ngramlist 的那个元素。我写了下面的代码,给出了正确的输出,但我想知道是否有办法优化它:
freqlist = [
sum(int(ngram == ngram_condidate)
for ngram_condidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
我想 nltk 或其他地方有一个函数可以更快地执行此操作,但我不确定是哪个。
谢谢!
编辑:就其价值而言,ngram 是作为 nltk.util.ngrams 的联合输出生成的,而 ngramlist
只是由所有找到的 ngram 组成的列表。
编辑 2:
这是测试频率列表行的可重现代码(其余代码并不是我真正关心的)
from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd
articles = ['New York City','Moscow','Beijing']
tokenizer = nltk.tokenize.TreebankWordTokenizer()
data={'article':[],'treebank_tokenizer':[]}
for article in articles:
data['article' ].append(wikipedia.page(article).content)
data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))
df=pd.DataFrame(data)
df['ngrams-3']=df['treebank_tokenizer'].map(
lambda x: [' '.join(t) for t in ngrams(x,3)])
ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))
df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])
您可以通过 pre-computing 一些数量并使用 Counter
来稍微优化一下。如果 ngramlist
中的大多数元素都包含在 ngrams
.
freqlist = [
sum(int(ngram == ngram_candidate)
for ngram_candidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
您当然不需要每次检查 ngram
时都遍历 ngrams
。一次通过 ngrams
将使该算法成为 O(n)
而不是您现在拥有的 O(n<sup>2</sup>)
。请记住,较短的代码不一定是更好或更高效的代码:
from collections import Counter
...
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
要正确使用此函数,您必须编写 def
函数而不是 lambda
:
def count_ngrams(ngrams):
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)
首先,不要通过覆盖它们并将它们用作变量来污染导入的函数,保留 ngrams
名称作为函数,并使用其他东西作为变量。
import time
from functools import partial
from itertools import chain
from collections import Counter
import wikipedia
import pandas as pd
from nltk import word_tokenize
from nltk.util import ngrams
接下来您在原始问题中询问的行之前的步骤可能有点低效,您可以清理它们,使它们更易于阅读和测量:
# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
然后:
# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
然后:
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
终于到了最后一行
# Instead of a set, we use a Counter here because
# we can use an intersection between Counter objects later.
# see
all_trigrams = Counter(chain(*df['trigrams']))
# More often than not, you don't need to keep all the
# zeros in the vectors (aka dense vector),
# you could actually get the non-zero sparse vectors
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)
# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
nonzero_features = Counter(list_of_ngrams) & all_trigrams
total = len(list_of_ngrams)
return {ng:count/total for ng, count in nonzero_features.items()}
df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)