将列中的列表分成行,并为交叉点添加多个标签

Divide list break in column into rows and multiple labeling for intersection

任务 1

假设的数据集

    Name    B   C
0   James   a   a,b,c,d
1   James   a   NaN
2   Rudy    b   a,f
3   Karl    c   e,c

在c列中,值是列表形式,我想将它们拆分并添加到行中。删除C列为NaN的值

输出我想要的

    Name    B   C
0   James   a   a
1   James   a   b
2   James   a   c
3   James   a   d
4   Rudy    b   a
5   Rudy    b   f
6   Karl    c   e
7   Karl    c   c

任务 2

我想根据 James、Rudy、Karl 和列 'C' 之间的关系进行标记。

标注标准表示交集)

Label    column 'C' value
 0       James  
 1       Rudy   
 2       Karl   
 3       James ∩ Rudy   
 4       James ∩ Karl       
 5       Rudy ∩ Karl        
 6       James ∩ Rudy ∩ Karl

我想根据每个列 'C' 值所属的位置进行标记。

最终结果如我所愿

    Name    B   C   Label
0   James   a   a   3
1   James   a   b   0
2   James   a   c   4
3   James   a   d   0
4   Rudy    b   a   3
5   Rudy    b   f   1
6   Karl    c   e   2
7   Karl    c   c   4

例如,'C' 列中的 'a' 被标记为 3,因为它在 James 和 Rudy 中都有

对我来说很难。如果你能帮助我,我将不胜感激。

感谢您的阅读。

对于Task 1,如果C列的数据如你所说的是list,你可以使用explode。

df.explode('C').dropna()

    Name    B   C
0   James   a   a
0   James   a   b
0   James   a   c
0   James   a   d
2   Rudy    b   a
2   Rudy    b   f
3   Karl    c   e
3   Karl    c   c

任务2,逻辑不太明白。

第一部分使用 DataFrame.explode with DataFrame.dropna and DataFrame.reset_indexdrop=True 作为默认索引:

#if values are lists
df1 = df.explode('C').dropna(subset=['C']).reset_index(drop=True)
#if values are separated by , add split
#df1 = df.assign(C = df['C'].str.split(',')).explode('C').dropna(subset=['C']).reset_index(drop=True)
print (df1)
    Name  B  C
0  James  a  a
0  James  a  b
0  James  a  c
0  James  a  d
2   Rudy  b  a
2   Rudy  b  f
3   Karl  c  e
3   Karl  c  c

然后通过名为 frozensets 的可哈希 sets 创建第二个 DataFrame,因此值的顺序并不重要:

#
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(1, len(ss)+1)))

L = [(i, frozenset(x)) for i, x in enumerate(all_subsets(df['Name'].unique()))]
df2 = pd.DataFrame(L, columns=['Label','C'])
print (df2)
   Label                    C
0      0              (James)
1      1               (Rudy)
2      2               (Karl)
3      3        (Rudy, James)
4      4        (James, Karl)
5      5         (Rudy, Karl)
6      6  (Rudy, James, Karl)

然后用 DataFrame.set_index, which is used for Series.map 创建系列以添加 frozensets 然后添加 Labels:

s = df2.set_index('C')['Label']
df["Label"] = df['C'].map(df.groupby('C')['Name'].apply(frozenset)).map(s)
print (df)

    Name  B  C  Label
0  James  a  a      3
1  James  a  b      0
2  James  a  c      4
3  James  a  d      0
4   Rudy  b  a      3
5   Rudy  b  f      1
6   Karl  c  e      2
7   Karl  c  c      4
import pandas as pd
import numpy as np

df = pd.DataFrame({'Name':['James', 'James', 'Rudy','Karl'],
                   'B':['a','a','b','c'],
                   'C':[['a','b','c','d'], np.nan, ['a','f'], ['e','c']]})

# Task 1
df = df.explode(column='C').reset_index(drop=True)
df.dropna(inplace=True)


# Task 2
labels = {'James'                :0,
          'Rudy'                 :1,
          'Karl'                 :2,
          'James ∩ Rudy'         :3,
          'James ∩ Karl'         :4,
          'Karl ∩ Ruby'          :5,
          'James ∩ Karl ∩ Rudy'  :6}

C_to_labels = df.groupby('C')['Name'].apply(lambda x: labels[' ∩ '.join(sorted(x))])
df['Label'] = df['C'].map(C_to_labels)

结果:

    Name  B  C  Label
0  James  a  a      3
1  James  a  b      0
2  James  a  c      4
3  James  a  d      0
4   Rudy  b  a      3
5   Rudy  b  f      1
6   Karl  c  e      2
7   Karl  c  c      4