在 scikit 学习中将 pandas NumPy 数组作为特征向量传递?
Passing pandas NumPy arrays as feature vectors in scikit learn?
我有一个包含 5 个不同值的向量用作样本值,标签是单个整数 0、1 或 3。当我将数组作为样本传递时,机器学习算法会起作用,但我收到这个警告。如何在不收到此警告的情况下传递特征向量?
import numpy as np
from numpy import random
from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
filepath = 'test.csv'
# example label values
index = [0,1,3,1,1,1,0,0]
# example sample arrays
data = []
for i in range(len(index)):
d = []
for i in range(6):
d.append(random.randint(50,200))
data.append(d)
feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'
features = [feat1, feat2, feat3, feat4, feat5, feat6]
df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'
with open(filepath, 'a') as f:
df.to_csv(f, header=f.tell() == 0)
states = pd.read_csv(filepath, usecols=['state'])
df_partial = pd.read_csv(filepath, usecols=features)
states = states.astype(np.float32)
states = states.values
labels = states
samples = np.array([])
for i, row in df_partial.iterrows():
r = row.values
samples = np.vstack((samples, r)) if samples.size else r
n_neighbors = 5
test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)
score1 = clf1.score(test_samples, test_labels)
print("Here's how the models performed \nknn: %d %%" %(score1 * 100))
警告:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"
sklearn documentation for fit(self, X, Y)
尝试替换
states = states.values
来自 states = states.values.flatten()
或
clf1 = clf1.fit(samples, labels)
来自 clf1 = clf1.fit(samples, labels.flatten())
。
states = states.values
包含存储在您的熊猫数据框中的正确标签,但是它们存储在不同的行中。使用 .flatten()
将所有这些标签放在同一行。 (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html)
在 Sklearn 的 KNeighborsClassifier 文档中
(https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html),他们在示例中表明标签必须存储在同一行:y = [0, 0, 1, 1]
.
当您从数据框状态检索数据时,它存储在多行(列向量)中,而它期望值在单行中。
您也可以尝试使用 ravel()
函数,该函数用于创建连续的展平数组。
numpy.ravel(array, order = ‘C’) :
returns 连续扁平数组(具有所有 input-array 元素且类型相同的一维数组)
尝试:
states = states.values.ravel()
代替 states = states.values
我有一个包含 5 个不同值的向量用作样本值,标签是单个整数 0、1 或 3。当我将数组作为样本传递时,机器学习算法会起作用,但我收到这个警告。如何在不收到此警告的情况下传递特征向量?
import numpy as np
from numpy import random
from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
filepath = 'test.csv'
# example label values
index = [0,1,3,1,1,1,0,0]
# example sample arrays
data = []
for i in range(len(index)):
d = []
for i in range(6):
d.append(random.randint(50,200))
data.append(d)
feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'
features = [feat1, feat2, feat3, feat4, feat5, feat6]
df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'
with open(filepath, 'a') as f:
df.to_csv(f, header=f.tell() == 0)
states = pd.read_csv(filepath, usecols=['state'])
df_partial = pd.read_csv(filepath, usecols=features)
states = states.astype(np.float32)
states = states.values
labels = states
samples = np.array([])
for i, row in df_partial.iterrows():
r = row.values
samples = np.vstack((samples, r)) if samples.size else r
n_neighbors = 5
test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)
score1 = clf1.score(test_samples, test_labels)
print("Here's how the models performed \nknn: %d %%" %(score1 * 100))
警告:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"
sklearn documentation for fit(self, X, Y)
尝试替换
states = states.values
来自 states = states.values.flatten()
或
clf1 = clf1.fit(samples, labels)
来自 clf1 = clf1.fit(samples, labels.flatten())
。
states = states.values
包含存储在您的熊猫数据框中的正确标签,但是它们存储在不同的行中。使用 .flatten()
将所有这些标签放在同一行。 (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html)
在 Sklearn 的 KNeighborsClassifier 文档中
(https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html),他们在示例中表明标签必须存储在同一行:y = [0, 0, 1, 1]
.
当您从数据框状态检索数据时,它存储在多行(列向量)中,而它期望值在单行中。
您也可以尝试使用 ravel()
函数,该函数用于创建连续的展平数组。
numpy.ravel(array, order = ‘C’) :
returns 连续扁平数组(具有所有 input-array 元素且类型相同的一维数组)
尝试:
states = states.values.ravel()
代替 states = states.values