两个数组或不同长度的向量之间的距离?
Distance between two arrays or vectors of different length?
我有一个程序可以使用 kNN 算法预测正面或负面评论。在对我的评论训练集做了词袋之后,我希望找到 vectors/arrays 之间的距离。但是我不能使用 euclidean_distances() 因为向量都是不同的距离。如何找到不同长度的向量之间的距离?
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
import math
import re
import random
def divide_chunks(l, n):
for i in range(0, len(l),n):
yield l[i:i + n]
with open(r"D:\Desktop\1565964985_2925534_train_file.data", "r") as f:
data_lines = f.readlines()
sentiments = list()
reviews = list()
for i, line in enumerate(data_lines):
s = ''.join(re.findall("^[+1]*[-1]*[0]*", line))
r = line.replace(s, '').strip()
#print('line:{} \n\t sentiment: {} \n\t review: {}'.format(i, s, r))
sentiments.append(s)
reviews.append(r)
n = 1
x = list(divide_chunks(reviews, n))
print(x[0])
count = CountVectorizer()
docs = np.array(x[0])
bag = count.fit_transform(docs)
print(bag.toarray())
docs = np.array(x[1])
bag1 = count.fit_transform(docs)
print(bag1.toarray())
euclidean_distances(bag, bag1)
f.close()
错误和回溯:
[[1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 3 1
1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 6 1 1 2 1 1 1 1 1 1 5 1 1 1 1 2 2 1 1 1 1
2 1]]
[[1 1 3 3 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 6 1 1 1 2 1 6 1 1 1 4 3
1 1 1 1 1 3 4 1 1 1 1 1 1 1 1 4 1 1 3 1 1 1 1 3 1 2 1 1 1 1 1 2 1 1 1 2
1 2 1 1 1 5 1 1 2 1 1 1 8 1 4 1 1 1 1 3 2 1 3 1 1 2 1 1]]
Traceback (most recent call last):
File "D:/Users/Not_J/PycharmProjects/untitled/HW1_Durand.py", line 82, in <module>
euclidean_distances(bag, bag1)
File "D:\Users\Not_J\PycharmProjects\HW1_Durand\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 232, in euclidean_distances
X, Y = check_pairwise_arrays(X, Y)
File "D:\Users\Not_J\PycharmProjects\HW1_Durand\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 125, in check_pairwise_arrays
X.shape[1], Y.shape[1]))
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 74 while Y.shape[1] == 100 ```
尝试将第二个 fit_transform(docs) 更改为 transform(docs)。
查看 this 问题了解更多信息。
我有一个程序可以使用 kNN 算法预测正面或负面评论。在对我的评论训练集做了词袋之后,我希望找到 vectors/arrays 之间的距离。但是我不能使用 euclidean_distances() 因为向量都是不同的距离。如何找到不同长度的向量之间的距离?
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
import math
import re
import random
def divide_chunks(l, n):
for i in range(0, len(l),n):
yield l[i:i + n]
with open(r"D:\Desktop\1565964985_2925534_train_file.data", "r") as f:
data_lines = f.readlines()
sentiments = list()
reviews = list()
for i, line in enumerate(data_lines):
s = ''.join(re.findall("^[+1]*[-1]*[0]*", line))
r = line.replace(s, '').strip()
#print('line:{} \n\t sentiment: {} \n\t review: {}'.format(i, s, r))
sentiments.append(s)
reviews.append(r)
n = 1
x = list(divide_chunks(reviews, n))
print(x[0])
count = CountVectorizer()
docs = np.array(x[0])
bag = count.fit_transform(docs)
print(bag.toarray())
docs = np.array(x[1])
bag1 = count.fit_transform(docs)
print(bag1.toarray())
euclidean_distances(bag, bag1)
f.close()
错误和回溯:
[[1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 3 1
1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 6 1 1 2 1 1 1 1 1 1 5 1 1 1 1 2 2 1 1 1 1
2 1]]
[[1 1 3 3 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 6 1 1 1 2 1 6 1 1 1 4 3
1 1 1 1 1 3 4 1 1 1 1 1 1 1 1 4 1 1 3 1 1 1 1 3 1 2 1 1 1 1 1 2 1 1 1 2
1 2 1 1 1 5 1 1 2 1 1 1 8 1 4 1 1 1 1 3 2 1 3 1 1 2 1 1]]
Traceback (most recent call last):
File "D:/Users/Not_J/PycharmProjects/untitled/HW1_Durand.py", line 82, in <module>
euclidean_distances(bag, bag1)
File "D:\Users\Not_J\PycharmProjects\HW1_Durand\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 232, in euclidean_distances
X, Y = check_pairwise_arrays(X, Y)
File "D:\Users\Not_J\PycharmProjects\HW1_Durand\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 125, in check_pairwise_arrays
X.shape[1], Y.shape[1]))
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 74 while Y.shape[1] == 100 ```
尝试将第二个 fit_transform(docs) 更改为 transform(docs)。
查看 this 问题了解更多信息。