从另一列的每一行单元格中搜索字符串,如果找到匹配项,则在 pandas 中的找到的匹配项下方插入行
Search for string from each line of cell in another column and if match is found, insert row below the found match in pandas
我如何检查列 "B" 单元格(它可能包含多行)的值是否在列 "A" 中,如果是 - 插入孔行(例如,我有值 m32\nm83\nm18) 在第 "A" 列中找到匹配项的行下方(例如 m32)?
这是数据框:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m83
m65
m73 m77\nm78 558
m23
m98
m77
m18
m4
m12
m78
这就是我想要得到的:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m55 m32\nm83\nm18 123
m83
m55 m32\nm83\nm18 123
m65
m73 m77\nm78 558
m23
m98
m77
m73 m77\nm78 558
m18
m55 m32\nm83\nm18 123
m4
m12
m56 m12 546
m78
m73 m77\nm78 558
我试过这个:
def insert_row(idx, df, df_insert):
return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)
dfB = dfB[dfB.apply(lambda x: isinstance(x, str))]
dfBidx = dfB.index
j=0
for b in dfBidx:
try:
idx = df.index[df["A"].apply(lambda x: isinstance(x, str)).str.contains("|".join(dfB[b].split("\n")))]
for i in idx:
i+=j
df_new = df.loc[i]
df = insert_row(i+j+1, df, df_new)
j+= int(df_new.size/len(df_new.columns.values))
except:
pass
还有其他方法吗?我对列 "A" 中的 NaN 值有疑问,并且通常在使用函数时存在一些不匹配:
str(),
contains(),
apply()
编辑:
我有第二个数据框 (df2),我从中提取行并插入到 df 中。我在 "Keyword".
列中提取从 "test" 到 "test" 的行
df2
Keyword B C
test m32\nm83\nm18 123
something
something
something
test
something
something
test m12 546
something
test m77\nm78 558
test
something
所以,最后我需要这个:
df
A Keyword B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
test m32\nm83\nm18 123
something
something
something
m83
test m32\nm83\nm18 123
something
something
something
m65
m73 m77\nm78 558
m23
m98
m77
test m77\nm78 558
m18
test m32\nm83\nm18 123
something
something
something
m4
m12
test m12 546
something
m78
test m77\nm78 558
使用默认 RangeIndex
的解决方案。
使用源行索引 (d1
) 和列表理解重复行创建插入行索引的字典,还添加 0.5
以正确排序。最后 concat
all together, sort_index
并通过 reset_index
创建默认索引:
d = df['B'].dropna().to_dict()
print (d)
{0: 'm32\nm83\nm18', 1: 'm12', 6: 'm77\nm78'}
d1 = {k: df.index[df['A'].str.contains("|".join(v.split("\n")))] for k, v in d.items()}
print (d1)
{0: Int64Index([3, 4, 10], dtype='int64'),
1: Int64Index([12], dtype='int64'),
6: Int64Index([9, 13], dtype='int64')}
L = [pd.concat([df.loc[[k]]] * len(v)).set_index([v + .5]) for k, v in d1.items()]
df = pd.concat([df] + L).sort_index().reset_index(drop=True)
print (df)
A B C
0 m55 m32\nm83\nm18 123.0
1 m56 m12 546.0
2 m68 NaN NaN
3 m32 NaN NaN
4 m55 m32\nm83\nm18 123.0
5 m83 NaN NaN
6 m55 m32\nm83\nm18 123.0
7 m65 NaN NaN
8 m73 m77\nm78 558.0
9 m23 NaN NaN
10 m98 NaN NaN
11 m77 NaN NaN
12 m73 m77\nm78 558.0
13 m18 NaN NaN
14 m55 m32\nm83\nm18 123.0
15 m4 NaN NaN
16 m12 NaN NaN
17 m56 m12 546.0
18 m78 NaN NaN
19 m73 m77\nm78 558.0
我如何检查列 "B" 单元格(它可能包含多行)的值是否在列 "A" 中,如果是 - 插入孔行(例如,我有值 m32\nm83\nm18) 在第 "A" 列中找到匹配项的行下方(例如 m32)?
这是数据框:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m83
m65
m73 m77\nm78 558
m23
m98
m77
m18
m4
m12
m78
这就是我想要得到的:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m55 m32\nm83\nm18 123
m83
m55 m32\nm83\nm18 123
m65
m73 m77\nm78 558
m23
m98
m77
m73 m77\nm78 558
m18
m55 m32\nm83\nm18 123
m4
m12
m56 m12 546
m78
m73 m77\nm78 558
我试过这个:
def insert_row(idx, df, df_insert):
return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)
dfB = dfB[dfB.apply(lambda x: isinstance(x, str))]
dfBidx = dfB.index
j=0
for b in dfBidx:
try:
idx = df.index[df["A"].apply(lambda x: isinstance(x, str)).str.contains("|".join(dfB[b].split("\n")))]
for i in idx:
i+=j
df_new = df.loc[i]
df = insert_row(i+j+1, df, df_new)
j+= int(df_new.size/len(df_new.columns.values))
except:
pass
还有其他方法吗?我对列 "A" 中的 NaN 值有疑问,并且通常在使用函数时存在一些不匹配:
str(), contains(), apply()
编辑:
我有第二个数据框 (df2),我从中提取行并插入到 df 中。我在 "Keyword".
列中提取从 "test" 到 "test" 的行df2
Keyword B C
test m32\nm83\nm18 123
something
something
something
test
something
something
test m12 546
something
test m77\nm78 558
test
something
所以,最后我需要这个:
df
A Keyword B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
test m32\nm83\nm18 123
something
something
something
m83
test m32\nm83\nm18 123
something
something
something
m65
m73 m77\nm78 558
m23
m98
m77
test m77\nm78 558
m18
test m32\nm83\nm18 123
something
something
something
m4
m12
test m12 546
something
m78
test m77\nm78 558
使用默认 RangeIndex
的解决方案。
使用源行索引 (d1
) 和列表理解重复行创建插入行索引的字典,还添加 0.5
以正确排序。最后 concat
all together, sort_index
并通过 reset_index
创建默认索引:
d = df['B'].dropna().to_dict()
print (d)
{0: 'm32\nm83\nm18', 1: 'm12', 6: 'm77\nm78'}
d1 = {k: df.index[df['A'].str.contains("|".join(v.split("\n")))] for k, v in d.items()}
print (d1)
{0: Int64Index([3, 4, 10], dtype='int64'),
1: Int64Index([12], dtype='int64'),
6: Int64Index([9, 13], dtype='int64')}
L = [pd.concat([df.loc[[k]]] * len(v)).set_index([v + .5]) for k, v in d1.items()]
df = pd.concat([df] + L).sort_index().reset_index(drop=True)
print (df)
A B C
0 m55 m32\nm83\nm18 123.0
1 m56 m12 546.0
2 m68 NaN NaN
3 m32 NaN NaN
4 m55 m32\nm83\nm18 123.0
5 m83 NaN NaN
6 m55 m32\nm83\nm18 123.0
7 m65 NaN NaN
8 m73 m77\nm78 558.0
9 m23 NaN NaN
10 m98 NaN NaN
11 m77 NaN NaN
12 m73 m77\nm78 558.0
13 m18 NaN NaN
14 m55 m32\nm83\nm18 123.0
15 m4 NaN NaN
16 m12 NaN NaN
17 m56 m12 546.0
18 m78 NaN NaN
19 m73 m77\nm78 558.0