如何以可读(csv 或 txt)格式逐行获取 doc2vec 或 sen2vec 训练向量?

How to get doc2vec or sen2vec trained vectors in readable (csv or txt) format linewise?

我在 csv 文件中为我的新闻集训练了 fasttext 或 Sen2vec 或 word2vec 模型,每条新闻都有这样一行

0 Trump is a liar.....
1 Europa going for brexit.....
2 Russia is no more world power......

所以,我得到了经过训练的模型,现在我可以像这样愉快地为我的 csv 文件中的任何行获取向量 (快速文本)

import csv  
import re

train = open('tweets.train3','w')  
test = open('tweets.valid3','w')  
with open(r'C:\Users3\Desktop\data\osn-9.csv', mode='r', encoding = "utf- 
 8" ,errors='ignore') as csv_file:  
csv_reader = csv.DictReader(csv_file, fieldnames=['sen', 'text'])
line = 0
for row in csv_reader:
    # Clean the training data
    # First we lower case the text
    text = row["text"].lower()
    # remove links
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
    #Remove usernames
    text = re.sub('@[^\s]+','', text)
    text = ' '.join(re.sub("[\.\,\!\?\:\*\(\)\;\-\=]", " ", text).split())
    # replace hashtags by just words
    text = re.sub(r'#([^\s]+)', r'',  text)
    #correct all multiple white spaces to a single white space
    text = re.sub('[\s]+', ' ', text)
    # Additional clean up : removing words less than 3 chars, and remove 
    space at the beginning and teh end
    text = re.sub(r'\W*\b\w{1,3}\b', '', text)
    text = text.strip()
    line = line + 1
    # Split data into train and validation
    if line > 8416:
        print(f'__label__{row["sen"]} {text}', file=test)
    else:
        print(f'__label__{row["sen"]} {text}', file=train)
 import fasttext
 hyper_params = {"lr": 0.1,
"epoch": 500,
"wordNgrams": 2,
"dim": 100,
"loss":"softmax"}


model = fasttext.train_supervised(input='tweets.train3',**hyper_params)
model.get_sentence_vector('Trump is a liar.....')
array([-0.20266785,  0.3407566 ,  ...,  0.03044436,  0.39055538], 
dtype=float32).

或者那样 (gensim)

In [10]:
model.infer_vector(['Trump', 'is', 'a ', 'liar'])
Out[10]:
array([ 0.24116205,  0.07339828, -0.27019867, -0.19452883,  0.126193  ,
 ........................,
    0.09754166,  0.12638392, -0.09281237, -0.04791372,  0.15747668],
  dtype=float32)

但是我如何才能在我的 csv 文件中获取向量而不是每一行的数组?那样

0  Trump is a liar..... -0.20266785,  0.3407566 ,  ...,  0.03044436,  
1  Europa going for brexit..... 0.24116205,  0.07339828,.... -0.27019867
2  Russia is no more world power...... 0.12638392, -0.09281237 
 ...-0.04791372, 

或者那样

0   -0.20266785,  0.3407566 ,  ...,  0.03044436,  
1   0.24116205,  0.07339828,.... -0.27019867
2   0.12638392, -0.09281237...-0.0479137

CSV Python 库将帮助您入门。这些示例非常简单明了,您所要做的就是将列表作为参数传递并确保它具有正确的设置。

松散的例子:

import csv 

#This should be a list of all the lists that
#you would like to write into the csv
master_list = []

with open('mycsv.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for item in master_list:
        writer.writerow(item)

这至少可以让您入门。我做了光测试,它至少对我有用。