python - 多 class 逻辑回归预测季节
python - multi class logistic regression to predict season
我想完成我的逻辑回归算法,该算法根据商店名称和购买类别预测年度季节(示例数据见下文,并注意标签编码。商店名称是任何典型的字符串,而类别,tops
,是多种统一字符串输入之一。四个季节也一样。
store_df.head()
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
我的完整代码如下,我不确定为什么它不接受我的输入值的形状。我的目标是利用商店和类别来预测季节。
predict_df = store_df[['shop', 'category', 'season']]
predict_df.reset_index(drop = True, inplace = True)
le = LabelEncoder()
predict_df['shop'] = le.fit_transform(predict_df['shop'].astype('category'))
predict_df['top'] = le.fit_transform(predict_df['top'].astype('category'))
predict_df['season'] = le.fit_transform(predict_df['season'].astype('category'))
X, y = predict_df[['shop', 'top']], predict_df['season']
xtrain, ytrain, xtest, ytest = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(class_weight='balanced', fit_intercept=False, multi_class='multinomial', random_state=10)
lr.fit(xtrain, ytrain)
当我运行上面的时候,我遇到了错误,ValueError: bad input shape (19405, 2)
我的解释是它与两个特征输入有关,但我需要更改什么才能使用这两个特征?
这是一个工作示例,您可以使用它来比较您的代码并删除任何错误。我在数据框中添加了几行 - 详细信息和结果在代码之后。如您所见,该模型已正确预测了四个标签中的三个。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
le = LabelEncoder()
sc = StandardScaler()
X = pd.get_dummies(df.iloc[:, :2], drop_first=True).values.astype('float')
y = le.fit_transform(df.iloc[:, -1].values).astype('float')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
df
Out[32]:
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
5 642 2 1
6 638 1 1
7 466 3 0
8 455 4 0
9 643 2 1
y_test
Out[33]: array([2., 0., 0., 1.])
y_pred
Out[34]: array([2., 0., 2., 1.])
conf_mat
Out[35]:
array([[1, 0, 1],
[0, 1, 0],
[0, 0, 1]], dtype=int64)
我想完成我的逻辑回归算法,该算法根据商店名称和购买类别预测年度季节(示例数据见下文,并注意标签编码。商店名称是任何典型的字符串,而类别,tops
,是多种统一字符串输入之一。四个季节也一样。
store_df.head()
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
我的完整代码如下,我不确定为什么它不接受我的输入值的形状。我的目标是利用商店和类别来预测季节。
predict_df = store_df[['shop', 'category', 'season']]
predict_df.reset_index(drop = True, inplace = True)
le = LabelEncoder()
predict_df['shop'] = le.fit_transform(predict_df['shop'].astype('category'))
predict_df['top'] = le.fit_transform(predict_df['top'].astype('category'))
predict_df['season'] = le.fit_transform(predict_df['season'].astype('category'))
X, y = predict_df[['shop', 'top']], predict_df['season']
xtrain, ytrain, xtest, ytest = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(class_weight='balanced', fit_intercept=False, multi_class='multinomial', random_state=10)
lr.fit(xtrain, ytrain)
当我运行上面的时候,我遇到了错误,ValueError: bad input shape (19405, 2)
我的解释是它与两个特征输入有关,但我需要更改什么才能使用这两个特征?
这是一个工作示例,您可以使用它来比较您的代码并删除任何错误。我在数据框中添加了几行 - 详细信息和结果在代码之后。如您所见,该模型已正确预测了四个标签中的三个。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
le = LabelEncoder()
sc = StandardScaler()
X = pd.get_dummies(df.iloc[:, :2], drop_first=True).values.astype('float')
y = le.fit_transform(df.iloc[:, -1].values).astype('float')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
df
Out[32]:
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
5 642 2 1
6 638 1 1
7 466 3 0
8 455 4 0
9 643 2 1
y_test
Out[33]: array([2., 0., 0., 1.])
y_pred
Out[34]: array([2., 0., 2., 1.])
conf_mat
Out[35]:
array([[1, 0, 1],
[0, 1, 0],
[0, 0, 1]], dtype=int64)