重加权距离 returns 与 iris 数据集的 knn 中的常规距离结果相同

Question

我正在试验距离权重影响 kNN 算法性能的方式，并且为了一个可重现的例子，我正在使用 iris 数据集。

令我惊讶的是，2 个预测变量的权重是其余 2 个预测变量的 100 倍，使用未加权模型生成相同的预测。这是什么违反直觉的发现？

我的代码如下：

X_original = iris['data']
Y = iris['target']

sc = StandardScaler() # Defines the parameters of the Scaler

X = sc.fit_transform(X_original)  # Transforms the original data to standardized data and returns them

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)

split = sss.split(X, Y)

s = list(split)

train_index = s[0][0]

test_index = s[0][1]

X_train = X[train_index, ] 

X_test = X[test_index, ] 

Y_train = Y[train_index] 

Y_test = Y[test_index] 

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 6)

iris_fit = knn.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w1 = knn.predict(X_test)

weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)

knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2, 
                           metric_params={'w': weights})

iris_fit_w = knn_w.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w100 = knn_w.predict(X_test)

(predictions_w1 != predictions_w100).sum()
0

Answer 1

它们并不总是相同的，将运行dom 状态添加到您的火车测试拆分中，您将看到它如何随着不同的值而变化。

 StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)

此外，在第 3 个（花瓣长度）和第 4 个（花瓣宽度）特征上具有如此极端权重的加权 Minkowski 距离基本上给你相同的结果，就好像你只运行 KNN 在这 2 个特征上没有加权闵可夫斯基。由于它们似乎提供了很多信息，因此与考虑所有 4 个特征的情况相比，您得到非常相似的结果也就不足为奇了。见下面的维基图片

重加权距离 returns 与 iris 数据集的 knn 中的常规距离结果相同

Heavily weighted distance returns the same results as regular distance in knn with iris dataset

python

distance

knn

scikit-learn