使用word2vec提取段落的主要特征

Extract main feature of paragraphs using word2vec

我刚刚掌握了 Google 的 word2vec 模型,对这个概念还很陌生。我正在尝试使用以下方法提取段落的主要特征。

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('../../usr/myProject/word2vec/GoogleNews-vectors-negative300.bin', binary=True)

...

for para in paragraph_array:
    para_name = "para_"+ file_name + '{0}'
    sentence_array = d[para_name.format(number_of_paragraphs)] = []

    # Split Paragraph on basis of '.' or ? or !.
    for l in re.split(r"\.|\?|\!", para):
        # Split line into list using space.
        sentence_array.append(l)
        #sentence_array.append(l.split(" "))

     print (model.wv.most_similar(positive=para, topn = 1))

但是我收到以下错误,它说检查的段落不是词汇表中的一个词。

KeyError: 'word \'The Republic of Ghana is a country in West Africa. It borders Côte d\'Ivoire (also known as Ivory Coast) to the west, Burkina Faso to the north, Togo to the east, and the Gulf of Guinea to the south. The word "Ghana" means "Warrior King", Jackson, John G. Introduction to African Civilizations, 2001. Page 201. and was the source of the name "Guinea" (via French Guinoye) used to refer to the West African coast (as in Gulf of Guinea).\' not in vocabulary'

现在我知道 most_similar() 函数需要一个数组。但是我想知道如何使用word2vec模型将其翻译为提取一个主要特征或显示段落主要概念的词。

已修改

我修改了上面的代码以将 word_array 传递给 most_similar() 方法,但我收到以下错误。

Traceback (most recent call last): File "/home/manuelanayantarajeyaraj/PycharmProjects/ChatbotWord2Vec/new_approach.py", line 108, in print(model.wv.most_similar(positive=word_array, topn=1)) File "/home/manuelanayantarajeyaraj/usr/myProject/my_project/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 361, in most_similar for word, weight in positive + negative: ValueError: too many values to unpack (expected 2)

修改后的实现

for sentence in sentence_array:
    if sentence:
        for w in re.split(r"\.|\?|\!|\@|\#|$|\%|\^|\&|\*|\(|\)|\-",   sentence):
            split_word = w.split(" ")
            if split_word:
                word_array.append(split_word)
print(model.wv.most_similar(positive=word_array, topn=1))

非常感谢在这方面的任何建议。

您的错误表明您正在查找整个字符串 ('The Republic of Ghana is a country in West Africa. It borders Côte d\'Ivoire (also known as Ivory Coast) to the west, Burkina Faso to the north, Togo to the east, and the Gulf of Guinea to the south. The word "Ghana" means "Warrior King", Jackson, John G. Introduction to African Civilizations, 2001. Page 201. and was the source of the name "Guinea" (via French Guinoye) used to refer to the West African coast (as in Gulf of Guinea).'),就好像它是一个词,但该词不存在。

most_similar() 方法可以采用正例列表,但您必须将该字符串标记为可能位于词向量集中的词。 (这可能涉及打破空格和标点符号以匹配 Google 为准备该词向量集所做的一切。)

在这种情况下,most_similar() 将对所有给定单词的向量进行平均,并且 return 其他单词接近该平均值。

是否真正捕获了文本的 'main concept' 尚不清楚。虽然词向量可能有助于识别文本的概念,但这不是它们的主要或唯一功能,而且它不是自动的。您可能希望将这组词过滤为以其他方式唯一的词——例如总体上不太常见的词,或者在某些依赖于语料库的度量中有影响的词(例如 TF/IDF)。

我重新编写了整个代码,添加了检查点,以避免将空字符串存储到从段落、句子到单词的每个级别的对象中。

工作版本

for file_name in files:
    file_identifier = file_name
    file_array = file_dictionary[file_identifier] =[]
    #file_array = file_dictionary[file_name.format((file_count))] = []
    file_path = directory_path+'/'+file_name

    with open(file_path) as f:
        #Level 2 Intents : Each file's main intent (One for each file)
        first_line = f.readline()
        print ()
        print("Level 2 Intent for ", c, " : ", first_line)

        #Level 3 Intents : Each paragraph's main intent (one for each para)

        paragraph_count = 0

        data = f.read()
        splat = data.split("\n")
        paragraph_array = []

        for number, paragraph in enumerate(splat, 1):
            paragraph_identifier = file_name + "_paragraph_" + str(paragraph_count)
            #print(paragraph_identifier)
            paragraph_array = paragraph_dictionary[paragraph_identifier.format(paragraph_count)] = []
            if paragraph :
                paragraph_array.append(paragraph)
            paragraph_count += 1
            if len(paragraph_array) >0 :
                file_array.append(paragraph_array)

            # Level 4 Intents : Each sentence's main intent (one for each sentence)

            sentence_count = 0
            sentence_array = []

            for sentence in paragraph_array:
                for line in re.split(r"\.|\?|\!", sentence):
                    sentence_identifier = paragraph_identifier + "_sentence_" + str(sentence_count)
                    sentence_array = sentence_dictionary[sentence_identifier.format(sentence_count)] = []
                    if line :
                        sentence_array.append(line)
                        sentence_count += 1

                    # Level 5 Intents : Each word with a certain level of prominance (one for each prominant word)

                    word_count = 0
                    word_array = []

                    for words in sentence_array:
                        for word in re.split(r" ", words):
                            word_identifier = sentence_identifier + "_word_" + str(word_count)
                            word_array = word_dictionary[word_identifier.format(word_count)] = []

                            if word :
                                word_array.append(word)
                                word_count += 1

访问字典项目的代码

#Accessing any paragraph array can be done as follows
print (paragraph_dictionary['S08_set4_a5.txt.clean_paragraph_4'])

#Accessing any sentence corresponding to a paragraph
print (sentence_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1'])

#Accessing any word corresponding to a sentence
print (word_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1_word_3'])

输出

['Celsius was born in Uppsala in Sweden. He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France.']
[' He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France']
['of']