从另一列的每一行单元格中搜索字符串,如果找到匹配项,则在 pandas 中的找到的匹配项下方插入行

Search for string from each line of cell in another column and if match is found, insert row below the found match in pandas

我如何检查列 "B" 单元格(它可能包含多行)的值是否在列 "A" 中,如果是 - 插入孔行(例如,我有值 m32\nm83\nm18) 在第 "A" 列中找到匹配项的行下方(例如 m32)?

这是数据框:

df

  A      B                  C
  m55    m32\nm83\nm18      123
  m56    m12                546
  m68
  m32
  m83
  m65
  m73    m77\nm78           558
  m23
  m98
  m77
  m18
  m4
  m12
  m78

这就是我想要得到的:

df

   A      B                  C
  m55    m32\nm83\nm18      123
  m56    m12                546
  m68
  m32
  m55    m32\nm83\nm18      123
  m83
  m55    m32\nm83\nm18      123
  m65
  m73    m77\nm78           558
  m23
  m98
  m77
  m73    m77\nm78           558
  m18
  m55    m32\nm83\nm18      123
  m4
  m12
  m56    m12                546
  m78
  m73    m77\nm78           558

我试过这个:

def insert_row(idx, df, df_insert):
    return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)

dfB = dfB[dfB.apply(lambda x: isinstance(x, str))]
dfBidx = dfB.index

j=0
for b in dfBidx:
    try:
        idx = df.index[df["A"].apply(lambda x: isinstance(x, str)).str.contains("|".join(dfB[b].split("\n")))]
        for i in idx:
            i+=j
            df_new = df.loc[i]
            df = insert_row(i+j+1, df, df_new)
            j+= int(df_new.size/len(df_new.columns.values))
    except:
        pass

还有其他方法吗?我对列 "A" 中的 NaN 值有疑问,并且通常在使用函数时存在一些不匹配:

str(), contains(), apply()

编辑:

我有第二个数据框 (df2),我从中提取行并插入到 df 中。我在 "Keyword".

列中提取从 "test" 到 "test" 的行

df2

  Keyword      B                  C
  test         m32\nm83\nm18      123
  something
  something
  something
  test
  something
  something
  test         m12                546
  something
  test         m77\nm78           558
  test
  something

所以,最后我需要这个:

df

  A         Keyword      B                  C
  m55                    m32\nm83\nm18      123
  m56                    m12                546
  m68
  m32
            test         m32\nm83\nm18      123
            something
            something
            something
  m83
            test         m32\nm83\nm18      123
            something
            something
            something
  m65
  m73                    m77\nm78           558
  m23
  m98
  m77
            test         m77\nm78           558
  m18
            test         m32\nm83\nm18      123
            something
            something
            something
  m4
  m12
            test         m12                546
            something
  m78
            test         m77\nm78           558

使用默认 RangeIndex 的解决方案。

使用源行索引 (d1) 和列表理解重复行创建插入行索引的字典,还添加 0.5 以正确排序。最后 concat all together, sort_index 并通过 reset_index 创建默认索引:

d = df['B'].dropna().to_dict()
print (d)
{0: 'm32\nm83\nm18', 1: 'm12', 6: 'm77\nm78'}

d1 = {k: df.index[df['A'].str.contains("|".join(v.split("\n")))] for k, v in d.items()}
print (d1)
{0: Int64Index([3, 4, 10], dtype='int64'), 
 1: Int64Index([12], dtype='int64'), 
 6: Int64Index([9, 13], dtype='int64')}

L = [pd.concat([df.loc[[k]]] * len(v)).set_index([v + .5]) for k, v in d1.items()]

df = pd.concat([df] + L).sort_index().reset_index(drop=True)
print (df)
      A              B      C
0   m55  m32\nm83\nm18  123.0
1   m56            m12  546.0
2   m68            NaN    NaN
3   m32            NaN    NaN
4   m55  m32\nm83\nm18  123.0
5   m83            NaN    NaN
6   m55  m32\nm83\nm18  123.0
7   m65            NaN    NaN
8   m73       m77\nm78  558.0
9   m23            NaN    NaN
10  m98            NaN    NaN
11  m77            NaN    NaN
12  m73       m77\nm78  558.0
13  m18            NaN    NaN
14  m55  m32\nm83\nm18  123.0
15   m4            NaN    NaN
16  m12            NaN    NaN
17  m56            m12  546.0
18  m78            NaN    NaN
19  m73       m77\nm78  558.0