到每个质心的 Kmeans 欧几里得距离避免从 DF 的其余部分分裂特征
Kmeans Euclidean Distance to Each Centroid Avoid Splitting Features From Rest of DF
我有一个 df:
id Type1 Type2 Type3
0 10000 0.0 0.00 0.00
1 10001 0.0 63.72 0.00
2 10002 473.6 174.00 31.60
3 10003 0.0 996.00 160.92
4 10004 0.0 524.91 0.00
我将 k-means 应用于此 df 并将生成的集群添加到 df:
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(df.drop('id', axis=1))
df['cluster'] = kmeans.labels_
现在我正在尝试向 df 添加列,以获取每个点(即 df 中的行)和每个质心之间的欧氏距离:
def distance_to_centroid(row, centroid):
row = row[['Type1',
'Type2',
'Type3']]
return euclidean(row, centroid)
df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
这会导致此错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-56fa3ae3df54> in <module>()
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
~\_installed\anaconda\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
<ipython-input-34-56fa3ae3df54> in <lambda>(r)
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
<ipython-input-33-7b988ca2ad8c> in distance_to_centroid(row, centroid)
7 'atype',
8 'anothertype']]
----> 9 return euclidean(row, centroid)
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in euclidean(u, v, w)
596
597 """
--> 598 return minkowski(u, v, p=2, w=w)
599
600
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in minkowski(u, v, p, w)
488 if p < 1:
489 raise ValueError("p must be at least 1")
--> 490 u_v = u - v
491 if w is not None:
492 w = _validate_weights(w)
ValueError: ('operands could not be broadcast together with shapes (7,) (8,) ', 'occurred at index 0')
出现此错误是因为 id
未包含在函数 distance_to_centroid
的 row
变量中。为了解决这个问题,我可以将 df 分成两部分(df1 中的 id
和 df2 中的其余列)。但是,这是非常手动的,并且不允许轻松更改列。有没有办法在不拆分原始df的情况下将每个质心的距离放入原始df中?同样,是否有更好的方法来查找欧几里得距离,而不涉及手动将列输入 row
变量,以及手动创建许多列作为簇?
预期结果:
id Type1 Type2 Type3 cluster distanct_to_cluster_0
0 10000 0.0 0.00 0.00 1 2.3
1 10001 0.0 63.72 0.00 2 3.6
2 10002 473.6 174.00 31.60 0 0.5
3 10003 0.0 996.00 160.92 3 3.7
4 10004 0.0 524.91 0.00 4 1.8
我们需要将 df
的坐标部分传递给 KMeans
,并且我们希望仅使用 df
的坐标部分来计算到质心的距离。所以我们不妨为这个数量定义一个变量:
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
然后我们可以使用以下方法计算从每行的坐标部分到其相应质心的距离:
import scipy.spatial.distance as sdist
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
请注意 centroids[df['cluster']]
returns 一个与 points
形状相同的 NumPy 数组。通过 df['cluster']
"expands" centroids
数组进行索引。
然后我们可以使用
将这些 dist
值分配给 DataFrame 列
df['dist'] = dist
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
df['dist'] = dist
print(df)
产量
Type1 Type2 Type3 id cluster dist
0 0.0 0.00 0.00 1000 4 2.842171e-14
1 0.0 63.72 0.00 10001 2 2.842171e-14
2 473.6 174.00 31.60 10002 1 2.842171e-14
3 0.0 996.00 160.92 10003 3 2.842171e-14
4 0.0 524.91 0.00 10004 0 2.842171e-14
如果你想要每个点到每个簇质心的距离,你可以使用sdist.cdist
:
import scipy.spatial.distance as sdist
sdist.cdist(points, centroids)
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dists = pd.DataFrame(
sdist.cdist(points, centroids),
columns=['dist_{}'.format(i) for i in range(len(centroids))],
index=df.index)
df = pd.concat([df, dists], axis=1)
print(df)
产量
Type1 Type2 Type3 id cluster dist_0 dist_1 dist_2 dist_3 dist_4
0 0.0 0.00 0.00 1000 4 524.910000 505.540819 6.372000e+01 1008.915877 0.000000
1 0.0 63.72 0.00 10001 2 461.190000 487.295802 2.842171e-14 946.066195 63.720000
2 473.6 174.00 31.60 10002 1 590.282431 0.000000 4.872958e+02 957.446929 505.540819
3 0.0 996.00 160.92 10003 3 497.816266 957.446929 9.460662e+02 0.000000 1008.915877
4 0.0 524.91 0.00 10004 0 0.000000 590.282431 4.611900e+02 497.816266 524.910000
我有一个 df:
id Type1 Type2 Type3
0 10000 0.0 0.00 0.00
1 10001 0.0 63.72 0.00
2 10002 473.6 174.00 31.60
3 10003 0.0 996.00 160.92
4 10004 0.0 524.91 0.00
我将 k-means 应用于此 df 并将生成的集群添加到 df:
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(df.drop('id', axis=1))
df['cluster'] = kmeans.labels_
现在我正在尝试向 df 添加列,以获取每个点(即 df 中的行)和每个质心之间的欧氏距离:
def distance_to_centroid(row, centroid):
row = row[['Type1',
'Type2',
'Type3']]
return euclidean(row, centroid)
df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
这会导致此错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-56fa3ae3df54> in <module>()
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
~\_installed\anaconda\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
<ipython-input-34-56fa3ae3df54> in <lambda>(r)
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
<ipython-input-33-7b988ca2ad8c> in distance_to_centroid(row, centroid)
7 'atype',
8 'anothertype']]
----> 9 return euclidean(row, centroid)
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in euclidean(u, v, w)
596
597 """
--> 598 return minkowski(u, v, p=2, w=w)
599
600
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in minkowski(u, v, p, w)
488 if p < 1:
489 raise ValueError("p must be at least 1")
--> 490 u_v = u - v
491 if w is not None:
492 w = _validate_weights(w)
ValueError: ('operands could not be broadcast together with shapes (7,) (8,) ', 'occurred at index 0')
出现此错误是因为 id
未包含在函数 distance_to_centroid
的 row
变量中。为了解决这个问题,我可以将 df 分成两部分(df1 中的 id
和 df2 中的其余列)。但是,这是非常手动的,并且不允许轻松更改列。有没有办法在不拆分原始df的情况下将每个质心的距离放入原始df中?同样,是否有更好的方法来查找欧几里得距离,而不涉及手动将列输入 row
变量,以及手动创建许多列作为簇?
预期结果:
id Type1 Type2 Type3 cluster distanct_to_cluster_0
0 10000 0.0 0.00 0.00 1 2.3
1 10001 0.0 63.72 0.00 2 3.6
2 10002 473.6 174.00 31.60 0 0.5
3 10003 0.0 996.00 160.92 3 3.7
4 10004 0.0 524.91 0.00 4 1.8
我们需要将 df
的坐标部分传递给 KMeans
,并且我们希望仅使用 df
的坐标部分来计算到质心的距离。所以我们不妨为这个数量定义一个变量:
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
然后我们可以使用以下方法计算从每行的坐标部分到其相应质心的距离:
import scipy.spatial.distance as sdist
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
请注意 centroids[df['cluster']]
returns 一个与 points
形状相同的 NumPy 数组。通过 df['cluster']
"expands" centroids
数组进行索引。
然后我们可以使用
将这些dist
值分配给 DataFrame 列
df['dist'] = dist
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
df['dist'] = dist
print(df)
产量
Type1 Type2 Type3 id cluster dist
0 0.0 0.00 0.00 1000 4 2.842171e-14
1 0.0 63.72 0.00 10001 2 2.842171e-14
2 473.6 174.00 31.60 10002 1 2.842171e-14
3 0.0 996.00 160.92 10003 3 2.842171e-14
4 0.0 524.91 0.00 10004 0 2.842171e-14
如果你想要每个点到每个簇质心的距离,你可以使用sdist.cdist
:
import scipy.spatial.distance as sdist
sdist.cdist(points, centroids)
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dists = pd.DataFrame(
sdist.cdist(points, centroids),
columns=['dist_{}'.format(i) for i in range(len(centroids))],
index=df.index)
df = pd.concat([df, dists], axis=1)
print(df)
产量
Type1 Type2 Type3 id cluster dist_0 dist_1 dist_2 dist_3 dist_4
0 0.0 0.00 0.00 1000 4 524.910000 505.540819 6.372000e+01 1008.915877 0.000000
1 0.0 63.72 0.00 10001 2 461.190000 487.295802 2.842171e-14 946.066195 63.720000
2 473.6 174.00 31.60 10002 1 590.282431 0.000000 4.872958e+02 957.446929 505.540819
3 0.0 996.00 160.92 10003 3 497.816266 957.446929 9.460662e+02 0.000000 1008.915877
4 0.0 524.91 0.00 10004 0 0.000000 590.282431 4.611900e+02 497.816266 524.910000