如何以可读(csv 或 txt)格式逐行获取 doc2vec 或 sen2vec 训练向量?
How to get doc2vec or sen2vec trained vectors in readable (csv or txt) format linewise?
我在 csv 文件中为我的新闻集训练了 fasttext 或 Sen2vec 或 word2vec 模型,每条新闻都有这样一行
0 Trump is a liar.....
1 Europa going for brexit.....
2 Russia is no more world power......
所以,我得到了经过训练的模型,现在我可以像这样愉快地为我的 csv 文件中的任何行获取向量
(快速文本)
import csv
import re
train = open('tweets.train3','w')
test = open('tweets.valid3','w')
with open(r'C:\Users3\Desktop\data\osn-9.csv', mode='r', encoding = "utf-
8" ,errors='ignore') as csv_file:
csv_reader = csv.DictReader(csv_file, fieldnames=['sen', 'text'])
line = 0
for row in csv_reader:
# Clean the training data
# First we lower case the text
text = row["text"].lower()
# remove links
text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
#Remove usernames
text = re.sub('@[^\s]+','', text)
text = ' '.join(re.sub("[\.\,\!\?\:\*\(\)\;\-\=]", " ", text).split())
# replace hashtags by just words
text = re.sub(r'#([^\s]+)', r'', text)
#correct all multiple white spaces to a single white space
text = re.sub('[\s]+', ' ', text)
# Additional clean up : removing words less than 3 chars, and remove
space at the beginning and teh end
text = re.sub(r'\W*\b\w{1,3}\b', '', text)
text = text.strip()
line = line + 1
# Split data into train and validation
if line > 8416:
print(f'__label__{row["sen"]} {text}', file=test)
else:
print(f'__label__{row["sen"]} {text}', file=train)
import fasttext
hyper_params = {"lr": 0.1,
"epoch": 500,
"wordNgrams": 2,
"dim": 100,
"loss":"softmax"}
model = fasttext.train_supervised(input='tweets.train3',**hyper_params)
model.get_sentence_vector('Trump is a liar.....')
array([-0.20266785, 0.3407566 , ..., 0.03044436, 0.39055538],
dtype=float32).
或者那样
(gensim)
In [10]:
model.infer_vector(['Trump', 'is', 'a ', 'liar'])
Out[10]:
array([ 0.24116205, 0.07339828, -0.27019867, -0.19452883, 0.126193 ,
........................,
0.09754166, 0.12638392, -0.09281237, -0.04791372, 0.15747668],
dtype=float32)
但是我如何才能在我的 csv 文件中获取向量而不是每一行的数组?那样
0 Trump is a liar..... -0.20266785, 0.3407566 , ..., 0.03044436,
1 Europa going for brexit..... 0.24116205, 0.07339828,.... -0.27019867
2 Russia is no more world power...... 0.12638392, -0.09281237
...-0.04791372,
或者那样
0 -0.20266785, 0.3407566 , ..., 0.03044436,
1 0.24116205, 0.07339828,.... -0.27019867
2 0.12638392, -0.09281237...-0.0479137
CSV Python 库将帮助您入门。这些示例非常简单明了,您所要做的就是将列表作为参数传递并确保它具有正确的设置。
松散的例子:
import csv
#This should be a list of all the lists that
#you would like to write into the csv
master_list = []
with open('mycsv.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for item in master_list:
writer.writerow(item)
这至少可以让您入门。我做了光测试,它至少对我有用。
我在 csv 文件中为我的新闻集训练了 fasttext 或 Sen2vec 或 word2vec 模型,每条新闻都有这样一行
0 Trump is a liar.....
1 Europa going for brexit.....
2 Russia is no more world power......
所以,我得到了经过训练的模型,现在我可以像这样愉快地为我的 csv 文件中的任何行获取向量 (快速文本)
import csv
import re
train = open('tweets.train3','w')
test = open('tweets.valid3','w')
with open(r'C:\Users3\Desktop\data\osn-9.csv', mode='r', encoding = "utf-
8" ,errors='ignore') as csv_file:
csv_reader = csv.DictReader(csv_file, fieldnames=['sen', 'text'])
line = 0
for row in csv_reader:
# Clean the training data
# First we lower case the text
text = row["text"].lower()
# remove links
text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
#Remove usernames
text = re.sub('@[^\s]+','', text)
text = ' '.join(re.sub("[\.\,\!\?\:\*\(\)\;\-\=]", " ", text).split())
# replace hashtags by just words
text = re.sub(r'#([^\s]+)', r'', text)
#correct all multiple white spaces to a single white space
text = re.sub('[\s]+', ' ', text)
# Additional clean up : removing words less than 3 chars, and remove
space at the beginning and teh end
text = re.sub(r'\W*\b\w{1,3}\b', '', text)
text = text.strip()
line = line + 1
# Split data into train and validation
if line > 8416:
print(f'__label__{row["sen"]} {text}', file=test)
else:
print(f'__label__{row["sen"]} {text}', file=train)
import fasttext
hyper_params = {"lr": 0.1,
"epoch": 500,
"wordNgrams": 2,
"dim": 100,
"loss":"softmax"}
model = fasttext.train_supervised(input='tweets.train3',**hyper_params)
model.get_sentence_vector('Trump is a liar.....')
array([-0.20266785, 0.3407566 , ..., 0.03044436, 0.39055538],
dtype=float32).
或者那样 (gensim)
In [10]:
model.infer_vector(['Trump', 'is', 'a ', 'liar'])
Out[10]:
array([ 0.24116205, 0.07339828, -0.27019867, -0.19452883, 0.126193 ,
........................,
0.09754166, 0.12638392, -0.09281237, -0.04791372, 0.15747668],
dtype=float32)
但是我如何才能在我的 csv 文件中获取向量而不是每一行的数组?那样
0 Trump is a liar..... -0.20266785, 0.3407566 , ..., 0.03044436,
1 Europa going for brexit..... 0.24116205, 0.07339828,.... -0.27019867
2 Russia is no more world power...... 0.12638392, -0.09281237
...-0.04791372,
或者那样
0 -0.20266785, 0.3407566 , ..., 0.03044436,
1 0.24116205, 0.07339828,.... -0.27019867
2 0.12638392, -0.09281237...-0.0479137
CSV Python 库将帮助您入门。这些示例非常简单明了,您所要做的就是将列表作为参数传递并确保它具有正确的设置。
松散的例子:
import csv
#This should be a list of all the lists that
#you would like to write into the csv
master_list = []
with open('mycsv.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for item in master_list:
writer.writerow(item)
这至少可以让您入门。我做了光测试,它至少对我有用。