逻辑回归预测故障

Question

我一直在努力解决泰坦尼克号幸存的问题。我将 x 分成乘客，将 y 分成幸存者。但问题是我无法获得 y_pred （即）预测结果。因为所有值都是 0。我得到 0 值作为预测。如果有人能解决它，那将对我有帮助。因为这是我作为初学者的第一个分类器问题

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv('C:/Users/Umer/train.csv')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']


from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25, 
random_state = 0)


from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train,y_train)

#predicting the test set results


y_pred = classifier.predict(x_test)

Answer 1

我无法重现相同的结果，事实上，我复制粘贴了您的代码，并没有像您描述的那样将它们全部设为零，而是得到了：

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]

尽管如此，我在您的方法中注意到了一些您可能想了解的事情：

Pandas read_csv 中的默认分隔符是 , ，因此如果您的数据集变量由 tab 分隔（ 与我的相同have) ，然后你应该像这样指定分隔符：
```
df = pd.read_csv('titanic.csv', sep='\t')
```
PassengerId 没有您的模型可以从中学习以预测 Survived 人的有用信息，它只是一个连续的数字，随着每位新乘客的增加而增加。一般来说，在分类中，您需要利用所有让您的模型从中学习的特征（当然除非有冗余特征不会向模型添加任何信息），尤其是在您的数据集中，这是一个多变量数据集。
缩放 PassengerId 没有意义，因为 features scaling 通常在特征的大小、单位和范围变化很大时使用（例如 5kg和 5000gms)，在你的情况下，正如我提到的，它只是一个增量整数，没有 real 模型信息。

最后一件事，您应该为 StandardScaler 获取类型为 float 的数据，以避免出现如下警告：

DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.

所以你从一开始就这样转换:

x = df['PassengerId'].values.astype(float).reshape(-1,1)

最后，如果您仍然得到相同的结果，请将 link 添加到您的数据集。

更新

提供数据集后，结果证明你得到的结果是正确的，这又是因为我上面提到的第2个原因（即PassengerId没有提供有用的信息给模型，因此无法正确预测！）

您可以通过比较从数据集中添加更多特征前后的 log loss 来自行测试：

from sklearn.metrics import log_loss
df = pd.read_csv('train.csv', sep=',')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
random_state = 0)
classifier = LogisticRegression()
classifier.fit(x_train,y_train)
y_pred_train = classifier.predict(x_train)
# calculate and print the loss function using only the PassengerId
print(log_loss(y_train, y_pred_train))
#predicting the test set results
y_pred = classifier.predict(x_test)
print(y_pred)

输出

13.33982681120802
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0]

现在通过使用许多“应该有用”的信息：

from sklearn.metrics import log_loss
df = pd.read_csv('train.csv', sep=',')
# denote the words female and male as 0 and 1
df['Sex'].replace(['female','male'], [0,1], inplace=True)
# try three features that you think they are informative to the model
# so it can learn from them
x = df[['Fare', 'Pclass', 'Sex']].values.reshape(-1,3)
y = df['Survived']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
random_state = 0)
classifier = LogisticRegression()
classifier.fit(x_train,y_train)
y_pred_train = classifier.predict(x_train)
# calculate and print the loss function with the above 3 features
print(log_loss(y_train, y_pred_train))
#predicting the test set results
y_pred = classifier.predict(x_test)
print(y_pred)

输出

7.238735137632405
[0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
 1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
 1]

结论：

如您所见，损失给出了更好的价值（比以前少），现在预测更合理了！

逻辑回归预测故障

Logistic Regression prediction faults

classification

machine-learning

prediction

logistic-regression

更新