在 Pandas DataFrame 中拆分列列表

Question

我正在寻找解决以下问题的好方法。我目前的修复不是特别干净，我希望从您的见解中学习。

假设我有一个 Panda DataFrame，其条目如下所示：

>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])

>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]

>>> df
                            Color                   Texture   IsGlass
    1                         NaN                  ['Rough']        1
    2              ['Red', 'Blue']                       NaN        0 
    3  ['Blue', 'Green', 'Purple']  ['Silky','Shiny','Fuzzy']       1

所以索引中的每个观察结果都对应于我测量的关于它的颜色、质地以及它是否是玻璃的东西。我想做的是将它变成一个新的 "indicator" DataFrame，方法是为每个观察值创建一个列，如果我观察到它，则将相应的条目更改为一个，如果我没有信息，则将其更改为 NaN。

>>> df
         Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass               
    1    Nan  Nan  Nan   Nan    1     NaN   Nan   Nan     1        
    2     1    1   Nan   Nan    Nan   Nan   Nan   Nan     0 
    3    Nan   1    1     1     Nan    1     1     1      1

我有一个解决方案，它遍历每一列，查看它的值，并通过一系列 Try/Excepts 的非 Nan 值拆分列表，创建一个新列等，然后连接。

这是我第一次 post 到 Whosebug - 我希望这个 post 符合 posting 准则。谢谢。

Answer 1

对于每行中的每个 texture/color，我检查该值是否为空。如果不是，我们将该值添加为该行的 column = 1。

import numpy as np
import pandas as pd

df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])

df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]

for row in df.itertuples():

    if not np.all(pd.isnull(row.Color)):
        for val in row.Color:
            df.loc[row.Index,val] = 1     

    if not np.all(pd.isnull(row.Texture)):
        for val in row.Texture:
            df.loc[row.Index,val] = 1

Answer 2

堆叠技巧！

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = df.stack().unstack(fill_value=[])

def b(c):
    d = mlb.fit_transform(c)
    return pd.DataFrame(d, c.index, mlb.classes_)

pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)

   Blue  Green  Purple  Red  Fuzzy  Rough  Shiny  Silky IsGlass
1     0      0       0    0      0      1      0      0       1
2     1      0       0    1      0      0      0      0       0
3     1      1       1    0      1      0      1      1       1

Answer 3

我正在使用 pandas、get_dummies

l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]: 
   Blue  Green  Purple  Red  Fuzzy  Rough  Shiny  Silky  IsGlass
1     0      0       0    0      0      1      0      0        1
2     1      0       0    1      0      0      0      0        0
3     1      1       1    0      1      0      1      1        1

在 Pandas DataFrame 中拆分列列表

Splitting Column Lists in Pandas DataFrame

dataframe

python-3.x

pandas

pandas-groupby