编码 Dataframe 中的特征（数值和非数值）

Question

我想对包含非数值和数值信息的特征进行编码。这是我的代码

import pandas as pd
from sklearn import preprocessing
dictionary = {'Values':['Y','N','Y','N',99,'N'],'AGE':[23,24,12,-99,778,13]}
df = pd.DataFrame(dictionary, index = range(1,len(dictionary['Values'])+1))

encoder = preprocessing.LabelEncoder()
columns = df.columns.tolist()
for i in columns:
    if(df[i].dtype == 'object'):
        df[i] = encoder.fit_transform(df[i])
df

我的预期输出是：

    Values  AGE
1   1       23
2   0       24
3   1       12
4   0       -99
5   99      778
6   0       13

但是我收到错误 "TypeError: '<' not supported between instances of 'int' and 'str'"。

任何建议如何纠正这个问题？

Answer 1

LabelEncoder() 使用 sort() 函数，它与多类变量不兼容（它不能将字符串与整数进行比较 - 你怎么知道 'a' 是大于还是小于 1?)。您可以忽略数据集中的整数，只在字符串为 Values:

的行上忽略 fit-transform

dictionary = {'Values':['Y','N','Y','N',99,'N'],'AGE':[23,24,12,-99,778,13]}
df = pd.DataFrame(dictionary, index = range(1,len(dictionary['Values'])+1))

encoder = preprocessing.LabelEncoder()
columns = df.columns.tolist()

mask = df['Values'].apply(lambda x: isinstance(x, str))
df['Values'][mask] = encoder.fit_transform(df['Values'][mask])

df

（只是一个问题 - 为什么您需要 if(df[i].dtype == 'object') 和整个 for 循环？如果您只想对第一列进行编码，我不明白为什么要循环遍历这些列。 )

编码 Dataframe 中的特征（数值和非数值）

Encoding a feature in Dataframe (both Numerical and Non Numerical)

python

encoding

outliers

dataframe

pandas