如何根据 pandas 中的列值对数据进行分类?
How to categorize data based on column values in pandas?
假设我有这个数据框:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'payout': [.1, .15, .2, .3, 1.2, 1.3, 1.45, 2, 2.04, 3.011, 3.45, 1],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'payout', 'name', 'preTestScore', 'postTestScore'])
现在,我想根据 "payout" 列构建这些类别:
Cat1 : 0 <= x <= 1
Cat2 : 1 < x <= 2
Cat3 : 2 < x <= 3
Cat4 : 3 < x <= 4
并构建 postTestscore
列的总和
我是这样做的,使用布尔索引:
df.loc[(df['payout'] > 0) & (df['payout'] <= 1), 'postTestScore'].sum()
df.loc[(df['payout'] > 1) & (df['payout'] <= 2), 'postTestScore'].sum()
etc...
很好用,但是有人知道这个更简洁(pythonic)的解决方案吗?
尝试 pd.cut
和 groupby
:
df.groupby(pd.cut(df.payout, [0, 1, 2, 3, 4])).postTestScore.sum()
payout
(0, 1] 308
(1, 2] 246
(2, 3] 62
(3, 4] 132
Name: postTestScore, dtype: int64
按 cut
创建类别,然后 groupby
总和:
bins = [0,1,2,3,4]
labels=['Cat{}'.format(x) for x in range(1, len(bins))]
binned = pd.cut(df['payout'], bins=bins, labels=labels)
print (binned)
0 Cat1
1 Cat1
2 Cat1
3 Cat1
4 Cat2
5 Cat2
6 Cat2
7 Cat2
8 Cat3
9 Cat4
10 Cat4
11 Cat1
Name: payout, dtype: category
Categories (4, object): [Cat1 < Cat2 < Cat3 < Cat4]
df1 = df.groupby(binned)['postTestScore'].sum().reset_index()
print (df1)
payout postTestScore
0 Cat1 308
1 Cat2 246
2 Cat3 62
3 Cat4 132
同样是一行解法:
df1 = df.groupby(pd.cut(df['payout'],
bins=[0,1,2,3,4],
labels=['Cat1','Cat2','Cat3','Cat4']))['postTestScore'].sum()
print (df1)
payout
Cat1 308
Cat2 246
Cat3 62
Cat4 132
Name: postTestScore, dtype: int64
numpy
的另一个非常快速的解决方案:
labs = ['Cat{}'.format(x) for x in range(len(bins))]
a = np.array(labs)[np.array(bins).searchsorted(df['payout'].values)]
print (a)
['Cat1' 'Cat1' 'Cat1' 'Cat1' 'Cat2' 'Cat2' 'Cat2' 'Cat2' 'Cat3' 'Cat4'
'Cat4' 'Cat1']
df1 = df.groupby(a)['postTestScore'].sum().rename_axis('cats').reset_index()
print (df1)
cats postTestScore
0 Cat1 308
1 Cat2 246
2 Cat3 62
3 Cat4 132
假设我有这个数据框:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'payout': [.1, .15, .2, .3, 1.2, 1.3, 1.45, 2, 2.04, 3.011, 3.45, 1],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'payout', 'name', 'preTestScore', 'postTestScore'])
现在,我想根据 "payout" 列构建这些类别:
Cat1 : 0 <= x <= 1
Cat2 : 1 < x <= 2
Cat3 : 2 < x <= 3
Cat4 : 3 < x <= 4
并构建 postTestscore
我是这样做的,使用布尔索引:
df.loc[(df['payout'] > 0) & (df['payout'] <= 1), 'postTestScore'].sum()
df.loc[(df['payout'] > 1) & (df['payout'] <= 2), 'postTestScore'].sum()
etc...
很好用,但是有人知道这个更简洁(pythonic)的解决方案吗?
尝试 pd.cut
和 groupby
:
df.groupby(pd.cut(df.payout, [0, 1, 2, 3, 4])).postTestScore.sum()
payout
(0, 1] 308
(1, 2] 246
(2, 3] 62
(3, 4] 132
Name: postTestScore, dtype: int64
按 cut
创建类别,然后 groupby
总和:
bins = [0,1,2,3,4]
labels=['Cat{}'.format(x) for x in range(1, len(bins))]
binned = pd.cut(df['payout'], bins=bins, labels=labels)
print (binned)
0 Cat1
1 Cat1
2 Cat1
3 Cat1
4 Cat2
5 Cat2
6 Cat2
7 Cat2
8 Cat3
9 Cat4
10 Cat4
11 Cat1
Name: payout, dtype: category
Categories (4, object): [Cat1 < Cat2 < Cat3 < Cat4]
df1 = df.groupby(binned)['postTestScore'].sum().reset_index()
print (df1)
payout postTestScore
0 Cat1 308
1 Cat2 246
2 Cat3 62
3 Cat4 132
同样是一行解法:
df1 = df.groupby(pd.cut(df['payout'],
bins=[0,1,2,3,4],
labels=['Cat1','Cat2','Cat3','Cat4']))['postTestScore'].sum()
print (df1)
payout
Cat1 308
Cat2 246
Cat3 62
Cat4 132
Name: postTestScore, dtype: int64
numpy
的另一个非常快速的解决方案:
labs = ['Cat{}'.format(x) for x in range(len(bins))]
a = np.array(labs)[np.array(bins).searchsorted(df['payout'].values)]
print (a)
['Cat1' 'Cat1' 'Cat1' 'Cat1' 'Cat2' 'Cat2' 'Cat2' 'Cat2' 'Cat3' 'Cat4'
'Cat4' 'Cat1']
df1 = df.groupby(a)['postTestScore'].sum().rename_axis('cats').reset_index()
print (df1)
cats postTestScore
0 Cat1 308
1 Cat2 246
2 Cat3 62
3 Cat4 132