如何正确缩放数据集

How to scale datasets correctly

哪个更正确,或者还有其他缩放数据的方法吗? (我以 StandardScaler 为例) 我已经尝试了各种方法并计算了每个模型的准确性,但没有任何有意义的差异,但我想知道哪种方法更正确

dataset= pd.read_csv("wine.csv")

x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)

sc=StandardScaler()

x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

dataset= pd.read_csv("wine.csv")

x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]

sc=StandardScaler()

x = sc.fit_transform(x)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)

dataset= pd.read_csv("wine.csv")

x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)

sc=StandardScaler()

x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)


在训练模型期间不应看到或使用测试数据,因为它们用于断言模型的性能。

所以最后一个选项是正确的。缩放参数应仅在训练集上计算如下:

sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)