Pandas DataFrame Pivot 使用日期和计数
Pandas DataFrame Pivot Using Dates and Counts
我获取了一个大数据文件并设法使用 groupby 和 value_counts 获取下面的数据框。但是,我想对其进行格式化,使公司位于左侧,月份位于顶部,每个数字都是当月的呼叫次数,即第三列。
这是我要排序的代码:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
这是我的输出 df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
我要
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
我知道此文档 http://pandas.pydata.org/pandas-docs/stable/reshaping.html pandas 中的数据帧有一个漂亮的枢轴函数,所以我一直在尝试使用 df.pivot(index='recvd_dttm', 列='CompanyName', 值='NumberCalls')
一个问题是第三列没有名称,所以我不能将它用于值 = 'NumberCalls'。第二个问题是弄清楚如何在我的数据框中采用日期时间格式并使其仅按月显示。
编辑:
CompanyName 是第一列,recvd_dttm 是第 15 列。这是我经过更多尝试后的代码:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
它抛出这个错误:KeyError: 'recvd_dttm' 并且不会到达结果行。
您需要在创建数据透视表之前聚合数据 table。如果没有列名,您可以将其引用到 df.iloc[:, 1]
(第 2 列)或简单地重命名 df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294
我获取了一个大数据文件并设法使用 groupby 和 value_counts 获取下面的数据框。但是,我想对其进行格式化,使公司位于左侧,月份位于顶部,每个数字都是当月的呼叫次数,即第三列。
这是我要排序的代码:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
这是我的输出 df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
我要
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
我知道此文档 http://pandas.pydata.org/pandas-docs/stable/reshaping.html pandas 中的数据帧有一个漂亮的枢轴函数,所以我一直在尝试使用 df.pivot(index='recvd_dttm', 列='CompanyName', 值='NumberCalls')
一个问题是第三列没有名称,所以我不能将它用于值 = 'NumberCalls'。第二个问题是弄清楚如何在我的数据框中采用日期时间格式并使其仅按月显示。
编辑: CompanyName 是第一列,recvd_dttm 是第 15 列。这是我经过更多尝试后的代码:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
它抛出这个错误:KeyError: 'recvd_dttm' 并且不会到达结果行。
您需要在创建数据透视表之前聚合数据 table。如果没有列名,您可以将其引用到 df.iloc[:, 1]
(第 2 列)或简单地重命名 df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294