如何将（非一个）热编码转换为同一行上具有多个值的列

Question

我基本上想反转中提出的过程。

>>> import pandas as pd
>>> example_input = pd.DataFrame({"one"   : [0,1,0,1,0], 
                                  "two"   : [0,0,0,0,0],
                                  "three" : [1,1,1,1,0],
                                  "four"  : [1,1,0,0,0]
                                  })
>>> print(example_input)
   one  two  three  four
0    0    0      1     1
1    1    0      1     1
2    0    0      1     0
3    1    0      1     0
4    0    0      0     0
>>> desired_output = pd.DataFrame(["three, four", "one, three, four",
                                   "three", "one, three", ""])
>>> print(desired_output)
                  0
0       three, four
1  one, three, four
2             three
3        one, three
4

有很多关于反转单热编码的问题（示例 & 2），但答案依赖于每行只有一个二进制 class 处于活动状态，而我的数据可以有多个 class在同一行中处于活动状态。

接近满足我的需要，但它的多个 classes 分开在不同的行上。我需要我的结果是由分隔符（例如“，”）连接的字符串，这样输出的行数与输入的行数相同。

利用在这两个问题中找到的想法 (1 & 2)，我能够想出一个解决方案，但它需要一个普通的 python for 循环来遍历行，这我怀疑与完全使用 pandas.

的解决方案相比会很慢

输入数据帧可以使用实际的布尔值而不是整数编码，如果它使事情变得更容易的话。输出可以是数据帧或系列；我最终会将结果列添加到更大的数据框中。我也愿意使用 numpy 如果它允许更好的解决方案，但否则我宁愿坚持使用 pandas.

Answer 1

这是一个使用 python 列表推导式遍历每一行的解决方案：

import pandas as pd

def reverse_hot_encoding(df, sep=', '):
    df = df.astype(bool)
    l = [sep.join(df.columns[row]) for _, row in df.iterrows()]
    return pd.Series(l)

if __name__ == '__main__':
    example_input = pd.DataFrame({"one"   : [0,1,0,1,0], 
                                  "two"   : [0,0,0,0,0],
                                  "three" : [1,1,1,1,0],
                                  "four"  : [1,1,0,0,0]
                                  })
    print(reverse_hot_encoding(example_input))

这是输出：

0         three, four
1    one, three, four
2               three
3          one, three
4                    
dtype: object

Answer 2

您可以 DataFrame.dot，这比遍历数据框中的所有行 faster 多：

df.dot(df.columns + ', ').str.rstrip(', ')

0         three, four
1    one, three, four
2               three
3          one, three
4                    
dtype: object

如何将（非一个）热编码转换为同一行上具有多个值的列

How to convert (Not-One) Hot Encodings to a Column with Multiple Values on the Same Row

python

numpy

dataframe

pandas

one-hot-encoding