在 python 中将距离正方形转换为长格式
Turn a distances squareform to long format in python
代码:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
ids = ['1', '2', '3']
points=[(0,0), (1,1), (3,3)]
distances = pdist(np.array(points), metric='euclidean')
print(distances)
distance_matrix = squareform(distances)
print(distance_matrix)
打印:
[1.41421356 4.24264069 2.82842712]
[[0. 1.41421356 4.24264069]
[1.41421356 0. 2.82842712]
[4.24264069 2.82842712 0. ]]
符合预期
我想把它变成一个长格式,以便在 csv 中写入,如
id1,id2,distance
1,1,0
1,2,1.41421356
1,3,4.24264069
2,1,1.41421356
2,2,0
2,3,2.82842712
等 - 我应该如何去做才能获得最大效率?使用 pandas 是一个选项
使用 DataFrame
构造函数与 stack
:
df = pd.DataFrame(distance_matrix, index=ids, columns=ids).stack().reset_index()
df.columns=['id1','id2','distance']
print (df)
id1 id2 distance
0 1 1 0.000000
1 1 2 1.414214
2 1 3 4.242641
3 2 1 1.414214
4 2 2 0.000000
5 2 3 2.828427
6 3 1 4.242641
7 3 2 2.828427
8 3 3 0.000000
或 DataFrame
构造函数 numpy.repeat
, numpy.tile
and ravel
:
df = pd.DataFrame({'id1':np.repeat(ids, len(ids)),
'id2':np.tile(ids, len(ids)),
'dist':distance_matrix.ravel()})
print (df)
id1 id2 dist
0 1 1 0.000000
1 1 2 1.414214
2 1 3 4.242641
3 2 1 1.414214
4 2 2 0.000000
5 2 3 2.828427
6 3 1 4.242641
7 3 2 2.828427
8 3 3 0.000000
我建议使用 -
辅助函数-
import numpy as np
import functools
# by @unutbu
def indices_merged_arr_generic_using_cp(arr):
"""
Based on cartesian_product
(senderle)
"""
shape = arr.shape
arrays = [np.arange(s, dtype='int') for s in shape]
broadcastable = np.ix_(*arrays)
broadcasted = np.broadcast_arrays(*broadcastable)
rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), len(broadcasted)+1
out = np.empty(rows * cols, dtype=arr.dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
out[start:] = arr.flatten()
return out.reshape(cols, rows).T
用法 -
In [169]: out = indices_merged_arr_generic_using_cp(distance_matrix)
In [170]: np.savetxt('out.txt', out, fmt="%i,%i,%f")
In [171]: !cat out.txt
0,0,0.000000
0,1,1.414214
0,2,4.242641
1,0,1.414214
1,1,0.000000
1,2,2.828427
2,0,4.242641
2,1,2.828427
2,2,0.000000
要获得 distance_matrix
我们也可以使用 Scipy's cdist
: cdist(points, points)
。还有 eucl_dist
包(免责声明:我是它的作者),其中包含各种计算欧氏距离的方法,这些方法比 SciPy's cdist
更有效,尤其是对于大型数组。
代码:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
ids = ['1', '2', '3']
points=[(0,0), (1,1), (3,3)]
distances = pdist(np.array(points), metric='euclidean')
print(distances)
distance_matrix = squareform(distances)
print(distance_matrix)
打印:
[1.41421356 4.24264069 2.82842712]
[[0. 1.41421356 4.24264069]
[1.41421356 0. 2.82842712]
[4.24264069 2.82842712 0. ]]
符合预期
我想把它变成一个长格式,以便在 csv 中写入,如
id1,id2,distance
1,1,0
1,2,1.41421356
1,3,4.24264069
2,1,1.41421356
2,2,0
2,3,2.82842712
等 - 我应该如何去做才能获得最大效率?使用 pandas 是一个选项
使用 DataFrame
构造函数与 stack
:
df = pd.DataFrame(distance_matrix, index=ids, columns=ids).stack().reset_index()
df.columns=['id1','id2','distance']
print (df)
id1 id2 distance
0 1 1 0.000000
1 1 2 1.414214
2 1 3 4.242641
3 2 1 1.414214
4 2 2 0.000000
5 2 3 2.828427
6 3 1 4.242641
7 3 2 2.828427
8 3 3 0.000000
或 DataFrame
构造函数 numpy.repeat
, numpy.tile
and ravel
:
df = pd.DataFrame({'id1':np.repeat(ids, len(ids)),
'id2':np.tile(ids, len(ids)),
'dist':distance_matrix.ravel()})
print (df)
id1 id2 dist
0 1 1 0.000000
1 1 2 1.414214
2 1 3 4.242641
3 2 1 1.414214
4 2 2 0.000000
5 2 3 2.828427
6 3 1 4.242641
7 3 2 2.828427
8 3 3 0.000000
我建议使用
辅助函数-
import numpy as np
import functools
# by @unutbu
def indices_merged_arr_generic_using_cp(arr):
"""
Based on cartesian_product
(senderle)
"""
shape = arr.shape
arrays = [np.arange(s, dtype='int') for s in shape]
broadcastable = np.ix_(*arrays)
broadcasted = np.broadcast_arrays(*broadcastable)
rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), len(broadcasted)+1
out = np.empty(rows * cols, dtype=arr.dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
out[start:] = arr.flatten()
return out.reshape(cols, rows).T
用法 -
In [169]: out = indices_merged_arr_generic_using_cp(distance_matrix)
In [170]: np.savetxt('out.txt', out, fmt="%i,%i,%f")
In [171]: !cat out.txt
0,0,0.000000
0,1,1.414214
0,2,4.242641
1,0,1.414214
1,1,0.000000
1,2,2.828427
2,0,4.242641
2,1,2.828427
2,2,0.000000
要获得 distance_matrix
我们也可以使用 Scipy's cdist
: cdist(points, points)
。还有 eucl_dist
包(免责声明:我是它的作者),其中包含各种计算欧氏距离的方法,这些方法比 SciPy's cdist
更有效,尤其是对于大型数组。