如何在同一 python 代码中使用 Inner Join 和 groupby
How to use Inner Join and groupby in the same python code
我有来自 excel 文件的以下输入(Sheet1 和 Sheet2)
Sheet1:
Order ID | Order Date | Segment | Sales
1001 11-11-2016 Consumer 100
1001 11-11-2016 Consumer 200
2001 16-06-2016 Consumer 300
Sheet2:
Returned | Order ID
Yes 1001
我在 python 中使用下面的代码,其中我使用 inner join 和 groupby 来仅从两个工作表中获取匹配的记录
import pandas as pd
Sheet1 = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\sample data.xlsx", sheet_name='Sheet1')
Sheet2 = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\sample data.xlsx", sheet_name='Sheet2')
Order_Year = pd.DatetimeIndex(Sheet1['Order Date']).year
Sheet1.merge(Sheet2, on='Order ID', how='inner')
Sheet1.groupby(['Order ID',Order_Year, 'Segment'])['Sales'].sum()
输出:
正如您在上面的输出中看到的,它没有显示匹配的记录,而是显示了所有记录,我想要如下所示的输出。
要求输出:
有人可以帮我修改上面的 python 代码以获得 所需的输出 .
此致,
维卡斯
让我们试试这个,
print(
sheet1[sheet1['Order ID'].isin(sheet2['Order ID'])]
.assign(Year=pd.to_datetime(sheet1['Order Date']).dt.year)
.groupby(['Order ID', 'Segment', 'Year'])['Sales'].sum()
.reset_index(name="Sales_Sum")
)
Order ID Segment Year Sales_Sum
0 1001 Consumer 2016 300
在您的问题中,您正在将 groupby()
应用于 Sheet1 而不是加入 Dataframe。
s1 = '''Order ID Order Date Segment Sales
1001 11-11-2016 Consumer 100
1001 11-11-2016 Consumer 200
2001 16-06-2016 Consumer 300'''
s2 = '''Returned Order ID
Yes 1001'''
s1 = [[t.strip() for t in re.split(" ", l) if t!=""] for l in s1.split("\n") ]
s2 = [[t.strip() for t in re.split(" ", l) if t!=""] for l in s2.split("\n") ]
Sheet1 = pd.DataFrame(s1[1:], columns=s1[0])
Sheet1["Year"] = pd.DatetimeIndex(Sheet1['Order Date']).year
Sheet1["Sales"] = pd.to_numeric(Sheet1["Sales"])
Sheet2 = pd.DataFrame(s2[1:], columns=s2[0])
Sheet1.merge(Sheet2, on='Order ID', how='inner')\
.groupby(['Order ID','Year', 'Segment']).agg(Sales_sum=("Sales", np.sum)).reset_index()
输出
Order ID Year Segment Sales_sum
0 1001 2016 Consumer 300
我有来自 excel 文件的以下输入(Sheet1 和 Sheet2)
Sheet1:
Order ID | Order Date | Segment | Sales
1001 11-11-2016 Consumer 100
1001 11-11-2016 Consumer 200
2001 16-06-2016 Consumer 300
Sheet2:
Returned | Order ID
Yes 1001
我在 python 中使用下面的代码,其中我使用 inner join 和 groupby 来仅从两个工作表中获取匹配的记录
import pandas as pd
Sheet1 = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\sample data.xlsx", sheet_name='Sheet1')
Sheet2 = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\sample data.xlsx", sheet_name='Sheet2')
Order_Year = pd.DatetimeIndex(Sheet1['Order Date']).year
Sheet1.merge(Sheet2, on='Order ID', how='inner')
Sheet1.groupby(['Order ID',Order_Year, 'Segment'])['Sales'].sum()
输出:
正如您在上面的输出中看到的,它没有显示匹配的记录,而是显示了所有记录,我想要如下所示的输出。
要求输出:
有人可以帮我修改上面的 python 代码以获得 所需的输出 .
此致,
维卡斯
让我们试试这个,
print(
sheet1[sheet1['Order ID'].isin(sheet2['Order ID'])]
.assign(Year=pd.to_datetime(sheet1['Order Date']).dt.year)
.groupby(['Order ID', 'Segment', 'Year'])['Sales'].sum()
.reset_index(name="Sales_Sum")
)
Order ID Segment Year Sales_Sum
0 1001 Consumer 2016 300
在您的问题中,您正在将 groupby()
应用于 Sheet1 而不是加入 Dataframe。
s1 = '''Order ID Order Date Segment Sales
1001 11-11-2016 Consumer 100
1001 11-11-2016 Consumer 200
2001 16-06-2016 Consumer 300'''
s2 = '''Returned Order ID
Yes 1001'''
s1 = [[t.strip() for t in re.split(" ", l) if t!=""] for l in s1.split("\n") ]
s2 = [[t.strip() for t in re.split(" ", l) if t!=""] for l in s2.split("\n") ]
Sheet1 = pd.DataFrame(s1[1:], columns=s1[0])
Sheet1["Year"] = pd.DatetimeIndex(Sheet1['Order Date']).year
Sheet1["Sales"] = pd.to_numeric(Sheet1["Sales"])
Sheet2 = pd.DataFrame(s2[1:], columns=s2[0])
Sheet1.merge(Sheet2, on='Order ID', how='inner')\
.groupby(['Order ID','Year', 'Segment']).agg(Sales_sum=("Sales", np.sum)).reset_index()
输出
Order ID Year Segment Sales_sum
0 1001 2016 Consumer 300