如何对 pandas.Series 列进行二元分解

Question

我想将 pandas.Series 分解为其他几个列（列数 = 值数），保存该分解并将其与其他 DataFrame 或 Series 一起使用。像 pandas.get_dummies 这样的东西会记住映射并且可以处理 NaN.

示例。
鉴于以下 DataFrame：

A B 0 a 0 1 b 1 2 a 2 3 c 3

我想将系列 A 分解为：

A_a A_b A_c B 0 1 0 0 0 1 0 1 0 1 2 1 0 0 2 3 0 0 1 3

然后我想保存该因式分解并将其应用于其他 DataFrame（查看输入在 A 列中没有 c 值）：

A B A_a A_b A_c B 0 a 0 0 1 0 0 0 1 a 1 -> 1 1 0 0 1 2 b 2 2 0 1 0 2

这种事情有什么自动的方法吗？我可以手动完成。我正在尝试 scikit-learn LabelEncoder，但它无法处理 NaN。我想将它用于分类模型。

Answer 1

我认为没有办法自动执行此操作：

In [11]: res = df.pop("A").str.get_dummies()  # Note: pop removes column A from df

In [12]: res.columns = res.columns.map(lambda x: "A_" + x)

In [13]: res
Out[13]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    0    1

In [14]: res.join(df)
Out[14]:
   A_a  A_b  A_c  B
0    1    0    0  0
1    0    1    0  1
2    1    0    0  2
3    0    0    1  3

为了标准化，我会在您想要的列上使用 reindex_axis。即强制 df2 具有 df1 的列。

df2.reindex_axis(df1.columns, axis=1, fill_value=0)

如何对 pandas.Series 列进行二元分解

How to make a binary decomposition of pandas.Series column

python

machine-learning

pandas

scikit-learn