Python Pandas:categorize/bin 按零值的数字分组
Python Pandas: categorize/bin by numeric groupings with zero values
我不确定这是否是最有效的方式,但我正在努力将客户支出分组到 bins/buckets。
这是我正在处理的 df:
df.head()
Best_ID_S| Dollar
abc2464 0.00
fdhg357 672.00
hjg5235 250.00
mjhur57 199.00
erew3452 116.25
这是我的代码:
bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,1000000000000]
#I didn't know how to create 8000+ so I just added a crazy number in the end, it works
group_names = ['0-250','251-500','501-749','750-999','1000-1499','1500-1999','2000-2499','2500-2999','3000-3499','3500-3999','4000-4499','4500-4999','5000-5499','5500-5999','6000-6499','6500-6999','7000-7499','8000+']
categories = pd.cut(df_2014['Dollar'], bins, labels=group_names)
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names)
df['Buckets'] = pd.cut(df['Dollar'], bins)
这就是我得到的,当我做 df.head():
Best_ID_S| Dollar | Category | Buckets
abc2464 0.00 NaN
fdhg357 672.00 501-749 (500, 750]
hjg5235 250.00 0-250 (0, 250]
mjhur57 199.00 0-250 (0, 250]
erew3452 116.25 0-250 (0, 250]
如果美元价值为 0,我需要它是 0-250 的桶。但我得到了 NaN。
right
参数的默认值为真。数学上 (
表示排除左边的,所以需要 [
来包含左边的值。所以将 pd.cut 更改为
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names,right=False)
df['Buckets'] = pd.cut(df['Dollar'], bins,right=False)
Best_ID_S| Dollar Category Buckets
0 abc2464 0.00 0-250 [0, 250)
1 fdhg357 672.00 501-749 [500, 750)
2 hjg5235 250.00 251-500 [250, 500)
3 mjhur57 199.00 0-250 [0, 250)
4 erew3452 116.25 0-250 [0, 250)
Incase 使其左包含,您还可以通过保留右参数 True
将 include_lowest
设置为 True
。
要创建 8000 以上的 bin,您可以将最后一个 bin 用作 np.inf
bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,np.inf]
为了包括下限,您可以使用参数 include_lowest = True
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names, include_lowest=True)
df['Buckets'] = pd.cut(df['Dollar'], bins, include_lowest=True)
你得到
Best_ID_S Dollar Category Buckets
0 abc2464 0.00 0-250 [0, 250]
1 fdhg357 672.00 501-749 (500, 750]
2 hjg5235 250.00 0-250 [0, 250]
3 mjhur57 199.00 0-250 [0, 250]
4 erew3452 116.25 0-250 [0, 250]
我不确定这是否是最有效的方式,但我正在努力将客户支出分组到 bins/buckets。
这是我正在处理的 df:
df.head()
Best_ID_S| Dollar
abc2464 0.00
fdhg357 672.00
hjg5235 250.00
mjhur57 199.00
erew3452 116.25
这是我的代码:
bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,1000000000000]
#I didn't know how to create 8000+ so I just added a crazy number in the end, it works
group_names = ['0-250','251-500','501-749','750-999','1000-1499','1500-1999','2000-2499','2500-2999','3000-3499','3500-3999','4000-4499','4500-4999','5000-5499','5500-5999','6000-6499','6500-6999','7000-7499','8000+']
categories = pd.cut(df_2014['Dollar'], bins, labels=group_names)
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names)
df['Buckets'] = pd.cut(df['Dollar'], bins)
这就是我得到的,当我做 df.head():
Best_ID_S| Dollar | Category | Buckets
abc2464 0.00 NaN
fdhg357 672.00 501-749 (500, 750]
hjg5235 250.00 0-250 (0, 250]
mjhur57 199.00 0-250 (0, 250]
erew3452 116.25 0-250 (0, 250]
如果美元价值为 0,我需要它是 0-250 的桶。但我得到了 NaN。
right
参数的默认值为真。数学上 (
表示排除左边的,所以需要 [
来包含左边的值。所以将 pd.cut 更改为
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names,right=False)
df['Buckets'] = pd.cut(df['Dollar'], bins,right=False)
Best_ID_S| Dollar Category Buckets 0 abc2464 0.00 0-250 [0, 250) 1 fdhg357 672.00 501-749 [500, 750) 2 hjg5235 250.00 251-500 [250, 500) 3 mjhur57 199.00 0-250 [0, 250) 4 erew3452 116.25 0-250 [0, 250)
Incase 使其左包含,您还可以通过保留右参数 True
将 include_lowest
设置为 True
。
要创建 8000 以上的 bin,您可以将最后一个 bin 用作 np.inf
bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,np.inf]
为了包括下限,您可以使用参数 include_lowest = True
df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names, include_lowest=True)
df['Buckets'] = pd.cut(df['Dollar'], bins, include_lowest=True)
你得到
Best_ID_S Dollar Category Buckets
0 abc2464 0.00 0-250 [0, 250]
1 fdhg357 672.00 501-749 (500, 750]
2 hjg5235 250.00 0-250 [0, 250]
3 mjhur57 199.00 0-250 [0, 250]
4 erew3452 116.25 0-250 [0, 250]