如何合并两个 pandas DataFrame 并聚合一个特定的列
how to merge two pandas DataFrames and aggregate one specific column
我有 2 个 DataFrame:
city count school
0 New York 1 school_3
1 Washington 1 School_4
2 Washington 1 School_5
3 LA 1 School_1
4 LA 1 School_4
city count school
0 New York 1 School_3
1 Washington 1 School_1
2 LA 1 School_3
3 LA 2 School_4
我想得到这个结果:
city count school
0 New York 2 school_3
1 Washington 1 School_1
2 Washington 1 School_4
3 Washington 1 School_5
4 LA 1 School_1
5 LA 1 School_3
6 LA 3 School_4
代码如下
d1 = [{'city':'New York', 'school':'school_3', 'count':1},
{'city':'Washington', 'school':'School_4', 'count':1},
{'city':'Washington', 'school':'School_5', 'count':1},
{'city':'LA', 'school':'School_1', 'count':1},
{'city':'LA', 'school':'School_4', 'count':1}]
d2 = [{'city':'New York', 'school':'School_3', 'count':1},
{'city':'Washington', 'school':'School_1', 'count':1},
{'city':'LA', 'school':'School_3', 'count':1},
{'city':'LA', 'school':'School_4', 'count':2}]
x1 = pd.DataFrame(d1)
x2 = pd.DataFrame(d2)
#just get empty DataFrame
print pd.merge(x1, x2)
如何得到聚合结果?
你可以这样做:
>>> pd.concat([x1, x2]).groupby(["city", "school"], as_index=False)["count"].sum()
city school count
0 LA School_1 1
1 LA School_3 1
2 LA School_4 3
3 New York School_3 1
4 New York school_3 1
5 Washington School_1 1
6 Washington School_4 1
7 Washington School_5 1
请注意,由于数据中的错字,纽约出现了 2 次(school_3
vs School_3
)。
这是与@elyase 使用 pandas.DataFrame.merge(...)
的解决方案略有不同的实现
x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False)['count'].sum()
在 ipython notebook %timeit
中计时时,此方法比 @elyase 的(<1ms)
略快
100 loops, best of 3: 6.25 ms per loop #using concat(...) with @elyase's solution
100 loops, best of 3: 5.49 ms per loop #using merge(...) in this solution
此外,如果您想使用 pandas aggregate
功能,您还可以:
x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False).agg(numpy.sum)
唯一的免责声明是使用 agg(...)
是 3 种解决方案中最慢的。
显然所有 3 个都提供了正确的结果:
city school count
0 LA School_1 1
1 LA School_3 1
2 LA School_4 3
3 New York School_3 1
4 New York school_3 1
5 Washington School_1 1
6 Washington School_4 1
7 Washington School_5 1
我有 2 个 DataFrame:
city count school
0 New York 1 school_3
1 Washington 1 School_4
2 Washington 1 School_5
3 LA 1 School_1
4 LA 1 School_4
city count school
0 New York 1 School_3
1 Washington 1 School_1
2 LA 1 School_3
3 LA 2 School_4
我想得到这个结果:
city count school
0 New York 2 school_3
1 Washington 1 School_1
2 Washington 1 School_4
3 Washington 1 School_5
4 LA 1 School_1
5 LA 1 School_3
6 LA 3 School_4
代码如下
d1 = [{'city':'New York', 'school':'school_3', 'count':1},
{'city':'Washington', 'school':'School_4', 'count':1},
{'city':'Washington', 'school':'School_5', 'count':1},
{'city':'LA', 'school':'School_1', 'count':1},
{'city':'LA', 'school':'School_4', 'count':1}]
d2 = [{'city':'New York', 'school':'School_3', 'count':1},
{'city':'Washington', 'school':'School_1', 'count':1},
{'city':'LA', 'school':'School_3', 'count':1},
{'city':'LA', 'school':'School_4', 'count':2}]
x1 = pd.DataFrame(d1)
x2 = pd.DataFrame(d2)
#just get empty DataFrame
print pd.merge(x1, x2)
如何得到聚合结果?
你可以这样做:
>>> pd.concat([x1, x2]).groupby(["city", "school"], as_index=False)["count"].sum()
city school count
0 LA School_1 1
1 LA School_3 1
2 LA School_4 3
3 New York School_3 1
4 New York school_3 1
5 Washington School_1 1
6 Washington School_4 1
7 Washington School_5 1
请注意,由于数据中的错字,纽约出现了 2 次(school_3
vs School_3
)。
这是与@elyase 使用 pandas.DataFrame.merge(...)
x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False)['count'].sum()
在 ipython notebook %timeit
中计时时,此方法比 @elyase 的(<1ms)
100 loops, best of 3: 6.25 ms per loop #using concat(...) with @elyase's solution
100 loops, best of 3: 5.49 ms per loop #using merge(...) in this solution
此外,如果您想使用 pandas aggregate
功能,您还可以:
x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False).agg(numpy.sum)
唯一的免责声明是使用 agg(...)
是 3 种解决方案中最慢的。
显然所有 3 个都提供了正确的结果:
city school count
0 LA School_1 1
1 LA School_3 1
2 LA School_4 3
3 New York School_3 1
4 New York school_3 1
5 Washington School_1 1
6 Washington School_4 1
7 Washington School_5 1