sklearn 和命名列中的单热编码多列
One-hot-encoding multiple columns in sklearn and naming columns
我有以下代码来对我拥有的 2 列进行单热编码。
# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)
phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)
我想知道我如何在 4 行中执行此操作,同时在输出中获得正确命名的列。也就是说,我可以通过在 fit_transform
中包含两个列名称来创建一个正确的单热编码数组,但是当我尝试命名结果数据框的列时,它告诉我索引的形状不匹配:
ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)
对于背景,phone 和城市都有 3 个值。
city phone
0 CityA iPhone
1 CityB Android
2 CityB iPhone
3 CityA iPhone
4 CityC Android
你为什么不看一看pd.get_dummies?
编码方式如下:
df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)
你快完成了...就像你说的,你可以直接在 fit_transform
中添加你想要编码的所有列。
ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_
然后您只需要执行以下操作:
feature_labels = np.array(feature_labels).ravel()
这使您可以按照自己的意愿命名列:
features = pd.DataFrame(feature_arr, columns=feature_labels)
cat_features = [
"gender", "cholesterol", "gluc", "smoke", "alco"
]
data = pd.get_dummies(data, columns = cat_features)
我有以下代码来对我拥有的 2 列进行单热编码。
# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)
phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)
我想知道我如何在 4 行中执行此操作,同时在输出中获得正确命名的列。也就是说,我可以通过在 fit_transform
中包含两个列名称来创建一个正确的单热编码数组,但是当我尝试命名结果数据框的列时,它告诉我索引的形状不匹配:
ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)
对于背景,phone 和城市都有 3 个值。
city phone
0 CityA iPhone
1 CityB Android
2 CityB iPhone
3 CityA iPhone
4 CityC Android
你为什么不看一看pd.get_dummies? 编码方式如下:
df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)
你快完成了...就像你说的,你可以直接在 fit_transform
中添加你想要编码的所有列。
ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_
然后您只需要执行以下操作:
feature_labels = np.array(feature_labels).ravel()
这使您可以按照自己的意愿命名列:
features = pd.DataFrame(feature_arr, columns=feature_labels)
cat_features = [ "gender", "cholesterol", "gluc", "smoke", "alco" ] data = pd.get_dummies(data, columns = cat_features)