在对数据框的一列进行分箱后,如何制作一个新的数据框来计算每个分箱中的元素数量?
After binning a column of a dataframe, how to make a new dataframe to count the number of elements in each bin?
假设我有一个数据框,df
:
>>> df
Age Score
19 1
20 2
24 3
19 2
24 3
24 1
24 3
20 1
19 1
20 3
22 2
22 1
我想构建一个新数据框,将 Age
分箱并将每个分箱中的元素总数存储在不同的 Score
列中:
Age Score 1 Score 2 Score 3
19-21 2 4 3
22-24 2 2 9
这是我的做法,我觉得很复杂(意思是,它不应该这么难):
import numpy as np
import pandas as pd
data = pd.DataFrame(columns=['Age', 'Score'])
data['Age'] = [19,20,24,19,24,24,24,20,19,20,22,22]
data['Score'] = [1,2,3,2,3,1,3,1,1,3,2,1]
_, bins = np.histogram(data['Age'], 2)
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] #dynamically create labels
labels[0] = '{}-{}'.format(bins[0], bins[1])
df = pd.DataFrame(columns=['Score', labels[0], labels[1]])
df['Score'] = data.Score.unique()
for i in labels:
df[i] = np.zeros(3)
for i in range(len(data)):
for j in range(len(labels)):
m1, m2 = labels[j].split('-') # lower & upper bounds of the age interval
if ((float(data['Age'][i])>float(m1)) & (float(data['Age'][i])<float(m2))): # find the age group in which each age lies
if data['Score'][i]==1:
index = 0
elif data['Score'][i]==2:
index = 1
elif data['Score'][i]==3:
index = 2
df[labels[j]][index] += 1
df.sort_values('Score', inplace=True)
df.set_index('Score', inplace=True)
print(df)
这会产生
19.0-21.5 22.5-24.0
Score
1 2.0 2.0
2 4.0 2.0
3 3.0 9.0
是否有更好、更清洁、更高效的实现方法?
cats = ['1', '2', '3']
bins = [0, 1, 2, 3]
data = data[['Age']].join(pd.get_dummies(pd.cut(data.Score, bins, labels=cats)))
data['bins'] = pd.cut(data['Age'], bins=[19,21,24], include_lowest=True)
data.groupby('bins').sum()
Age 1 2 3
bins
(18.999, 21.0] 117 3 2 1
(21.0, 24.0] 140 2 1 3
您可以 remove/rename 分箱和年龄系列,这需要进行一些调整才能正确包含内容。
我不完全确定你想要什么结果(你是将计数乘以分数......?)但这可能会有所帮助:
>>> data['age_binned'] = pd.cut(data['Age'], [18,21,24])
>>> data.groupby(['age_binned', 'Score'])['Age'].nunique().unstack()
Score 1 2 3
age_binned
(18, 21] 2 2 1
(21, 24] 2 1 1
我假设你想要唯一元素的数量,如果你只想要元素的总数使用 .count() 而不是 .nunique()
IIUC,我想你可以尝试其中之一:
1.If 您已经知道这些垃圾箱:
df['Age'] = np.where(df['Age']<=21,'19-21','22-24')
df.groupby(['Age'])['Score'].value_counts().unstack()
2.If 你知道你需要的垃圾箱数量:
df.Age = pd.cut(df.Age, bins=2,include_lowest=True)
df.groupby(['Age'])['Score'].value_counts().unstack()
3.Jon Clements 来自评论的想法:
pd.crosstab(pd.cut(df.Age, [19, 21, 24],include_lowest=True), df.Score)
这三个都产生以下输出:
Score 1 2 3
Age
(18.999, 21.0] 3 2 1
(21.0, 24.0] 2 1 3
假设我有一个数据框,df
:
>>> df
Age Score
19 1
20 2
24 3
19 2
24 3
24 1
24 3
20 1
19 1
20 3
22 2
22 1
我想构建一个新数据框,将 Age
分箱并将每个分箱中的元素总数存储在不同的 Score
列中:
Age Score 1 Score 2 Score 3
19-21 2 4 3
22-24 2 2 9
这是我的做法,我觉得很复杂(意思是,它不应该这么难):
import numpy as np
import pandas as pd
data = pd.DataFrame(columns=['Age', 'Score'])
data['Age'] = [19,20,24,19,24,24,24,20,19,20,22,22]
data['Score'] = [1,2,3,2,3,1,3,1,1,3,2,1]
_, bins = np.histogram(data['Age'], 2)
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] #dynamically create labels
labels[0] = '{}-{}'.format(bins[0], bins[1])
df = pd.DataFrame(columns=['Score', labels[0], labels[1]])
df['Score'] = data.Score.unique()
for i in labels:
df[i] = np.zeros(3)
for i in range(len(data)):
for j in range(len(labels)):
m1, m2 = labels[j].split('-') # lower & upper bounds of the age interval
if ((float(data['Age'][i])>float(m1)) & (float(data['Age'][i])<float(m2))): # find the age group in which each age lies
if data['Score'][i]==1:
index = 0
elif data['Score'][i]==2:
index = 1
elif data['Score'][i]==3:
index = 2
df[labels[j]][index] += 1
df.sort_values('Score', inplace=True)
df.set_index('Score', inplace=True)
print(df)
这会产生
19.0-21.5 22.5-24.0
Score
1 2.0 2.0
2 4.0 2.0
3 3.0 9.0
是否有更好、更清洁、更高效的实现方法?
cats = ['1', '2', '3']
bins = [0, 1, 2, 3]
data = data[['Age']].join(pd.get_dummies(pd.cut(data.Score, bins, labels=cats)))
data['bins'] = pd.cut(data['Age'], bins=[19,21,24], include_lowest=True)
data.groupby('bins').sum()
Age 1 2 3
bins
(18.999, 21.0] 117 3 2 1
(21.0, 24.0] 140 2 1 3
您可以 remove/rename 分箱和年龄系列,这需要进行一些调整才能正确包含内容。
我不完全确定你想要什么结果(你是将计数乘以分数......?)但这可能会有所帮助:
>>> data['age_binned'] = pd.cut(data['Age'], [18,21,24])
>>> data.groupby(['age_binned', 'Score'])['Age'].nunique().unstack()
Score 1 2 3
age_binned
(18, 21] 2 2 1
(21, 24] 2 1 1
我假设你想要唯一元素的数量,如果你只想要元素的总数使用 .count() 而不是 .nunique()
IIUC,我想你可以尝试其中之一:
1.If 您已经知道这些垃圾箱:
df['Age'] = np.where(df['Age']<=21,'19-21','22-24')
df.groupby(['Age'])['Score'].value_counts().unstack()
2.If 你知道你需要的垃圾箱数量:
df.Age = pd.cut(df.Age, bins=2,include_lowest=True)
df.groupby(['Age'])['Score'].value_counts().unstack()
3.Jon Clements 来自评论的想法:
pd.crosstab(pd.cut(df.Age, [19, 21, 24],include_lowest=True), df.Score)
这三个都产生以下输出:
Score 1 2 3
Age
(18.999, 21.0] 3 2 1
(21.0, 24.0] 2 1 3