如何将 oneHotEncoder 定义为泰坦尼克号数据集
How to define oneHotEncoder to the titanic dataset
我正在尝试处理 titanic 数据集。数据具有分类值,因此我使用 labelEncoder 将数据更改为数字,而不是文本。之前:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.00 1 0 7.2500 S
1 2 1 1 female 38.00 1 0 71.2833 C
2 3 1 3 female 26.00 0 0 7.9250 S
之后:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 1 22.00 1 0 7.2500 2
1 2 1 1 0 38.00 1 0 71.2833 0
2 3 1 3 0 26.00 0 0 7.9250 2
这是代码:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
现在,因为乘客的性别同样重要,所以我想使用oneHotEncoder。据我了解,数据应如下所示:
PassengerId Survived Pclass Male Female Age SibSp Parch Fare Embarked
0 1 0 3 1 0 22.00 1 0 7.2500 2
1 2 1 1 0 1 38.00 1 0 71.2833 0
2 3 1 3 0 1 26.00 0 0 7.9250 2
如何编写代码来执行此操作?我曾尝试对 oneHotEncoder 使用类似的方法:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
onehotencoder = OneHotEncoder()
data['Embarked'] = onehotencoder.fit_transform(data['Embarked'].values.reshape(-1,1))
但它只是 return 相同的结果。我该如何解决?我是 Scikit 和 ML 的新手,我希望我做的事情是正确的。
这就是你可以做到的。
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
Sex
0 1
1 0
2 0
3 1
# OneHotEncoder
result = OneHotEncoder().fit_transform(df['Sex'].reshape(-1, 1)).toarray()
# Appending columns
df[['Female', 'Male']] = pd.DataFrame(result, index = df.index)
# Resulting dataframe
df
Sex Female Male
0 1 0.0 1.0
1 0 1.0 0.0
2 0 1.0 0.0
3 1 0.0 1.0
我正在尝试处理 titanic 数据集。数据具有分类值,因此我使用 labelEncoder 将数据更改为数字,而不是文本。之前:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.00 1 0 7.2500 S
1 2 1 1 female 38.00 1 0 71.2833 C
2 3 1 3 female 26.00 0 0 7.9250 S
之后:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 1 22.00 1 0 7.2500 2
1 2 1 1 0 38.00 1 0 71.2833 0
2 3 1 3 0 26.00 0 0 7.9250 2
这是代码:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
现在,因为乘客的性别同样重要,所以我想使用oneHotEncoder。据我了解,数据应如下所示:
PassengerId Survived Pclass Male Female Age SibSp Parch Fare Embarked
0 1 0 3 1 0 22.00 1 0 7.2500 2
1 2 1 1 0 1 38.00 1 0 71.2833 0
2 3 1 3 0 1 26.00 0 0 7.9250 2
如何编写代码来执行此操作?我曾尝试对 oneHotEncoder 使用类似的方法:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
onehotencoder = OneHotEncoder()
data['Embarked'] = onehotencoder.fit_transform(data['Embarked'].values.reshape(-1,1))
但它只是 return 相同的结果。我该如何解决?我是 Scikit 和 ML 的新手,我希望我做的事情是正确的。
这就是你可以做到的。
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
Sex
0 1
1 0
2 0
3 1
# OneHotEncoder
result = OneHotEncoder().fit_transform(df['Sex'].reshape(-1, 1)).toarray()
# Appending columns
df[['Female', 'Male']] = pd.DataFrame(result, index = df.index)
# Resulting dataframe
df
Sex Female Male
0 1 0.0 1.0
1 0 1.0 0.0
2 0 1.0 0.0
3 1 0.0 1.0