在没有匹配索引的情况下加入合并多个数据帧
joining merge multiple dataframes without matching index
我有一个完整的数据集,如下所示:
pandas==1.1.5
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
Name Area Type Group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
2 C Area2 BB C
3 D Area1 CC A B D E
4 E Area1 CC A B D E
5 F Area3 BB F
6 G Area4 AA G H
7 H Area4 CC G H
8 I Area5 BB I
9 J Area6 AA J L
10 L Area6 CC J L
11 M Area5 BB M
根据这个数据集,我创建了 3 个按类型分组的 df:
aa_df = all_df.loc[all_df['Type']=='AA']
aa_df = aa_df.rename(columns={'Group':'AA group'})
bb_df = all_df.loc[all_df['Type']=='BB']
bb_df = bb_df.rename(columns={'Group':'BB group'})
cc_df = all_df.loc[all_df['Type']=='CC']
cc_df = cc_df.rename(columns={'Group':'CC group'
Name Area Type AA group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
6 G Area4 AA G H
9 J Area6 AA J L
Name Area Type BB group
2 C Area2 BB C
5 F Area3 BB F
8 I Area5 BB I
11 M Area5 BB M
Name Area Type CC group
3 D Area1 CC A B D E
4 E Area1 CC A B D E
7 H Area4 CC G H
10 L Area6 CC J L
我的目标是按照这些规则加入他们:
- 所有成员按匹配区域分组。即 Area1 的名称为 A B D E
- AA 会员只有 Type = AA 。即 A B D E 只有 A 和 B 是 AA Type
- CC会员只有Type = CC
- BB成员一直单身,同时也是AA和CC成员
生成的 df 应如下所示
Name Area Type All Members AA Members CC Members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M
我不知道如何加入 3 种类型的 DF,因为我在 3 种类型之间没有共享索引,我想我需要某种类型的 isin
来回顾 all_df
并引用该组。但是该组就像您看到的一样,它的名称由空格分隔,所以我想我可能需要将其转换为列表?
有没有办法使用 pandas 执行此操作,或者我是否需要一系列循环和查找?
如果您认为您不需要分组的 dfs。您可以使用 groupby
计算您的成员,然后使用创建的 df 查找 AA 和 CC 成员。最后用 Name
:
填充 NA 值
import pandas as pd
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
members_df = all_df.groupby(['Area', 'Type']).agg({'Name': list})
#print(members_df)
def get_members(row, typ):
try:
return " ".join(members_df.loc[(row['Area'], typ), 'Name'])
except KeyError:
return
all_df['AA members'] = all_df.apply(lambda x: get_members(x, 'AA'), axis=1)
all_df['CC members'] = all_df.apply(lambda x: get_members(x, 'CC'), axis=1)
# filling na values
all_df.loc[all_df['AA members'].isna(), 'AA members'] = all_df['Name']
all_df.loc[all_df['CC members'].isna(), 'CC members'] = all_df['Name']
print(all_df)
输出:
Name Area Type Group AA members CC members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M
我有一个完整的数据集,如下所示: pandas==1.1.5
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
Name Area Type Group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
2 C Area2 BB C
3 D Area1 CC A B D E
4 E Area1 CC A B D E
5 F Area3 BB F
6 G Area4 AA G H
7 H Area4 CC G H
8 I Area5 BB I
9 J Area6 AA J L
10 L Area6 CC J L
11 M Area5 BB M
根据这个数据集,我创建了 3 个按类型分组的 df:
aa_df = all_df.loc[all_df['Type']=='AA']
aa_df = aa_df.rename(columns={'Group':'AA group'})
bb_df = all_df.loc[all_df['Type']=='BB']
bb_df = bb_df.rename(columns={'Group':'BB group'})
cc_df = all_df.loc[all_df['Type']=='CC']
cc_df = cc_df.rename(columns={'Group':'CC group'
Name Area Type AA group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
6 G Area4 AA G H
9 J Area6 AA J L
Name Area Type BB group
2 C Area2 BB C
5 F Area3 BB F
8 I Area5 BB I
11 M Area5 BB M
Name Area Type CC group
3 D Area1 CC A B D E
4 E Area1 CC A B D E
7 H Area4 CC G H
10 L Area6 CC J L
我的目标是按照这些规则加入他们:
- 所有成员按匹配区域分组。即 Area1 的名称为 A B D E
- AA 会员只有 Type = AA 。即 A B D E 只有 A 和 B 是 AA Type
- CC会员只有Type = CC
- BB成员一直单身,同时也是AA和CC成员
生成的 df 应如下所示
Name Area Type All Members AA Members CC Members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M
我不知道如何加入 3 种类型的 DF,因为我在 3 种类型之间没有共享索引,我想我需要某种类型的 isin
来回顾 all_df
并引用该组。但是该组就像您看到的一样,它的名称由空格分隔,所以我想我可能需要将其转换为列表?
有没有办法使用 pandas 执行此操作,或者我是否需要一系列循环和查找?
如果您认为您不需要分组的 dfs。您可以使用 groupby
计算您的成员,然后使用创建的 df 查找 AA 和 CC 成员。最后用 Name
:
import pandas as pd
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
members_df = all_df.groupby(['Area', 'Type']).agg({'Name': list})
#print(members_df)
def get_members(row, typ):
try:
return " ".join(members_df.loc[(row['Area'], typ), 'Name'])
except KeyError:
return
all_df['AA members'] = all_df.apply(lambda x: get_members(x, 'AA'), axis=1)
all_df['CC members'] = all_df.apply(lambda x: get_members(x, 'CC'), axis=1)
# filling na values
all_df.loc[all_df['AA members'].isna(), 'AA members'] = all_df['Name']
all_df.loc[all_df['CC members'].isna(), 'CC members'] = all_df['Name']
print(all_df)
输出:
Name Area Type Group AA members CC members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M