Q 如何绘制文本中的 n-gram 位置

Question

我正在尝试绘制文本中 n-gram 重复出现的位置和频率。这个想法是识别文本中的点，作者开始重复使用这些术语。某些类型的唯一性跨度应该比其他类型短。

字 1...n，位于 X 轴上。随着 n-gram 循环的频率变得 > 1，图上出现一个点，其中 X 是它的位置，Y 是频率，颜色是唯一的 n-gram。从下面的代码中，2-gram "good sport" 将被绘制为 (7, 2, RED).

问：如何使用 1.unique n-gram,2 创建 np.array。频率和 3. 在文本中的位置？

    from sklearn.feature_extraction.text import CountVectorizer 
    import nltk

    words = "good day, good night, good sport, good sport charlie"
    clean=re.sub("[^\w\d'\s]+",'',words)

    vectorizer2 = CountVectorizer( ngram_range=(2,2),         tokenizer=word_tokenize, stop_words='english')
    analyzer = vectorizer2.build_analyzer()
    two_grams=analyzer(clean)


    # Get the set of unique words.
    uniques = []
    for word in two_grams:
        if word not in uniques:
            uniques.append(word)

    # Make a list of (count, unique) tuples.
    counts = []
    for unique in uniques:
        count = 0              # Initialize the count to zero.
        for word in two_grams:     # Iterate over the words.
            if word == unique:   # Is this word equal to the current unique?
                count += 1         # If so, increment the count
        counts.append((count, unique))


    counts.sort()            # Sorting the list puts the lowest counts first.
    counts.reverse()         # Reverse it, putting the highest counts first.
    # Print the ten words with the highest counts.
    for i in range(min(10, len(counts))):
        count, word = counts[i]
        print('%s %d' % (word, count))

    #Scatterplot

    #plt.scatter(count, count, s=area, c=colors, alpha=0.5)
   ####plt.show()

Answer 1

我认为在尝试 scikit learn 之前，最好从一个简单而周到的算法开始。

如果是你的印迹，可以遍历字符串的字符直到第一个字符匹配，然后

ngram = "good sport"
words = "good day, good night, good sport, good sport charlie"

loc = []
k = 0
while k < len(words):
    if words[k] != ngram[0]:
        k += 1
    elif words[k: k+ len(ngram)] == ngram:
        loc += [k]
        k += 1
    else:
        k += 1

这个returnsloc = [22,34]。任何列表都可以变成数组，例如L = np.array(loc).

没有人声称这是有效的或 fail-proof，但最好了解您正在编码的内容 scikit-learn

Q 如何绘制文本中的 n-gram 位置

Q How to Plot n-gram positions in text

python

plot

n-gram