重加权距离 returns 与 iris 数据集的 knn 中的常规距离结果相同
Heavily weighted distance returns the same results as regular distance in knn with iris dataset
我正在试验距离权重影响 kNN 算法性能的方式,并且为了一个可重现的例子,我正在使用 iris 数据集。
令我惊讶的是,2 个预测变量的权重是其余 2 个预测变量的 100 倍,使用未加权模型生成相同的预测。这是什么违反直觉的发现?
我的代码如下:
X_original = iris['data']
Y = iris['target']
sc = StandardScaler() # Defines the parameters of the Scaler
X = sc.fit_transform(X_original) # Transforms the original data to standardized data and returns them
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
split = sss.split(X, Y)
s = list(split)
train_index = s[0][0]
test_index = s[0][1]
X_train = X[train_index, ]
X_test = X[test_index, ]
Y_train = Y[train_index]
Y_test = Y[test_index]
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
iris_fit = knn.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w1 = knn.predict(X_test)
weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)
knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2,
metric_params={'w': weights})
iris_fit_w = knn_w.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w100 = knn_w.predict(X_test)
(predictions_w1 != predictions_w100).sum()
0
它们并不总是相同的,将 运行dom 状态添加到您的火车测试拆分中,您将看到它如何随着不同的值而变化。
StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
此外,在第 3 个(花瓣长度)和第 4 个(花瓣宽度)特征上具有如此极端权重的加权 Minkowski 距离基本上给你相同的结果,就好像你只 运行 KNN 在这 2 个特征上没有加权闵可夫斯基。由于它们似乎提供了很多信息,因此与考虑所有 4 个特征的情况相比,您得到非常相似的结果也就不足为奇了。见下面的维基图片
我正在试验距离权重影响 kNN 算法性能的方式,并且为了一个可重现的例子,我正在使用 iris 数据集。
令我惊讶的是,2 个预测变量的权重是其余 2 个预测变量的 100 倍,使用未加权模型生成相同的预测。这是什么违反直觉的发现?
我的代码如下:
X_original = iris['data']
Y = iris['target']
sc = StandardScaler() # Defines the parameters of the Scaler
X = sc.fit_transform(X_original) # Transforms the original data to standardized data and returns them
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
split = sss.split(X, Y)
s = list(split)
train_index = s[0][0]
test_index = s[0][1]
X_train = X[train_index, ]
X_test = X[test_index, ]
Y_train = Y[train_index]
Y_test = Y[test_index]
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
iris_fit = knn.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w1 = knn.predict(X_test)
weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)
knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2,
metric_params={'w': weights})
iris_fit_w = knn_w.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w100 = knn_w.predict(X_test)
(predictions_w1 != predictions_w100).sum()
0
它们并不总是相同的,将 运行dom 状态添加到您的火车测试拆分中,您将看到它如何随着不同的值而变化。
StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
此外,在第 3 个(花瓣长度)和第 4 个(花瓣宽度)特征上具有如此极端权重的加权 Minkowski 距离基本上给你相同的结果,就好像你只 运行 KNN 在这 2 个特征上没有加权闵可夫斯基。由于它们似乎提供了很多信息,因此与考虑所有 4 个特征的情况相比,您得到非常相似的结果也就不足为奇了。见下面的维基图片