Pandas,反向一热编码

Question

我对一些变量进行了热编码，经过一些计算后我想检索原始变量。

我正在做的是：

我过滤了一个热编码的列名称（它们都以原始变量的名称开头，比如说 'mycol'）

filter_col = [col for col in df if col.startswith('mycol')]

然后我可以简单地将列名乘以过滤后的变量。

X_test[filter_col]*filter_col

但是，这会导致矩阵稀疏。如何从中创建一个变量？求和不起作用，因为空格被视为数字并这样做：sum(X_test[filter_col]*filter_col) 我得到

TypeError: unsupported operand type(s) for +: 'int' and 'str'

关于如何进行的任何建议？这是最好的方法还是有一些功能可以满足我的需求？

根据要求，这是一个示例，摘自 here:

df= pd.DataFrame({ 
    'mycol':np.random.choice( ['panda','python','shark'], 10),
    })

df=pd.get_dummies(df)

Answer 1

如果需要每行的总和值：

(X_test[filter_col]*filter_col).sum(axis=1)

如果可能，每行仅 0 或每行多个 1 的解决方案：

X_test = pd.DataFrame({
         'mycolB':[0,1,1,0],
         'mycolC':[0,0,1,0],
         'mycolD':[1,0,0,0],

})


filter_col = [col for col in X_test if col.startswith('mycol')]
df = X_test[filter_col].dot(pd.Index(filter_col) + ', ' ).str.strip(', ')
print (df)
0            mycolD
1            mycolB
2    mycolB, mycolC
3                  
dtype: object

Answer 2

IIUC，可以使用DataFrame.idxmax along axis=1. If necessary you can replace dummy prefix, with str.replace:

X_test[filter_col].idxmax(axis=1).str.replace('mycol_', '')

Pandas,反向一热编码

Pandas, reverse one hot encoding

python

pandas

one-hot-encoding