Q 如何绘制文本中的 n-gram 位置
Q How to Plot n-gram positions in text
我正在尝试绘制文本中 n-gram 重复出现的位置和频率。这个想法是识别文本中的点,作者开始重复使用这些术语。某些类型的唯一性跨度应该比其他类型短。
字 1...n,位于 X 轴上。随着 n-gram 循环的频率变得 > 1,图上出现一个点,其中 X 是它的位置,Y 是频率,颜色是唯一的 n-gram。从下面的代码中,2-gram "good sport" 将被绘制为 (7, 2, RED).
问:如何使用 1.unique n-gram,2 创建 np.array。频率和 3. 在文本中的位置?
from sklearn.feature_extraction.text import CountVectorizer
import nltk
words = "good day, good night, good sport, good sport charlie"
clean=re.sub("[^\w\d'\s]+",'',words)
vectorizer2 = CountVectorizer( ngram_range=(2,2), tokenizer=word_tokenize, stop_words='english')
analyzer = vectorizer2.build_analyzer()
two_grams=analyzer(clean)
# Get the set of unique words.
uniques = []
for word in two_grams:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in two_grams: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
#Scatterplot
#plt.scatter(count, count, s=area, c=colors, alpha=0.5)
####plt.show()
我认为在尝试 scikit learn 之前,最好从一个简单而周到的算法开始。
如果是你的印迹,可以遍历字符串的字符直到第一个字符匹配,然后
ngram = "good sport"
words = "good day, good night, good sport, good sport charlie"
loc = []
k = 0
while k < len(words):
if words[k] != ngram[0]:
k += 1
elif words[k: k+ len(ngram)] == ngram:
loc += [k]
k += 1
else:
k += 1
这个returnsloc = [22,34]
。任何列表都可以变成数组,例如L = np.array(loc)
.
没有人声称这是有效的或 fail-proof,但最好了解您正在编码的内容 scikit-learn
我正在尝试绘制文本中 n-gram 重复出现的位置和频率。这个想法是识别文本中的点,作者开始重复使用这些术语。某些类型的唯一性跨度应该比其他类型短。
字 1...n,位于 X 轴上。随着 n-gram 循环的频率变得 > 1,图上出现一个点,其中 X 是它的位置,Y 是频率,颜色是唯一的 n-gram。从下面的代码中,2-gram "good sport" 将被绘制为 (7, 2, RED).
问:如何使用 1.unique n-gram,2 创建 np.array。频率和 3. 在文本中的位置?
from sklearn.feature_extraction.text import CountVectorizer
import nltk
words = "good day, good night, good sport, good sport charlie"
clean=re.sub("[^\w\d'\s]+",'',words)
vectorizer2 = CountVectorizer( ngram_range=(2,2), tokenizer=word_tokenize, stop_words='english')
analyzer = vectorizer2.build_analyzer()
two_grams=analyzer(clean)
# Get the set of unique words.
uniques = []
for word in two_grams:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in two_grams: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
#Scatterplot
#plt.scatter(count, count, s=area, c=colors, alpha=0.5)
####plt.show()
我认为在尝试 scikit learn 之前,最好从一个简单而周到的算法开始。
如果是你的印迹,可以遍历字符串的字符直到第一个字符匹配,然后
ngram = "good sport"
words = "good day, good night, good sport, good sport charlie"
loc = []
k = 0
while k < len(words):
if words[k] != ngram[0]:
k += 1
elif words[k: k+ len(ngram)] == ngram:
loc += [k]
k += 1
else:
k += 1
这个returnsloc = [22,34]
。任何列表都可以变成数组,例如L = np.array(loc)
.
没有人声称这是有效的或 fail-proof,但最好了解您正在编码的内容 scikit-learn