尝试在 python 中创建分组变量
Trying to create grouped variable in python
我有一列年龄值需要转换为 18-29、30-39、40-49、50-59、60-69 和 70+ 的年龄范围:
以 df 'file' 中的一些数据为例,我有:
并想去:
我尝试了以下方法:
file['agerange'] = file[['age']].apply(lambda x: "18-29" if (x[0] > 16
or x[0] < 30) else "other")
我不想只做一个 groupby,因为桶的大小不统一,但如果可行的话,我愿意接受它作为解决方案。
提前致谢!
您可以使用 itertools.groupby
使用 // 10
作为键函数。
In [10]: ages = [random.randint(18, 99) for _ in range(100)]
In [11]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=lambda x: x // 10)]
Out[11]:
[(1, [18]),
(2, [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
(3, [30, 30, 32, 32, 34, 35, 36, 37, 37]),
(4, [41, 42, 42, 43, 43, 44, 45, 47, 48]),
(5, [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
(6, [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
(7, [71, 71, 72, 72, 73, 75, 75, 77, 77, 78]),
(8, [83, 83, 83, 83, 84, 84, 85, 86, 86, 87, 87, 88, 89, 89, 89]),
(9, [91, 91, 92, 92, 93, 94, 97, 97, 98, 98, 99, 99, 99])]
请记住,groupby
需要排序的数据,因此请先排序。或者使用字典和循环手动完成。
In [14]: groups = collections.defaultdict(list)
In [15]: for x in ages:
....: groups[x//10].append(x)
In [16]: groups
Out[16]: defaultdict(<type 'list'>, {1: [18],
2: [26, 28, 21, 20, 26, 24, 21, 27, 25, 23, 27, 26, 22],
3: [37, 30, 32, 32, 35, 30, 36, 37, 34],
4: [45, 42, 43, 41, 47, 43, 48, 44, 42],
5: [52, 56, 58, 55, 58, 51, 58, 58, 57, 56, 53, 56, 50, 54, 56],
6: [69, 65, 62, 61, 65, 66, 66, 62, 69, 66, 67, 66, 60, 62, 69],
7: [71, 77, 71, 72, 77, 73, 78, 72, 75, 75],
8: [87, 83, 84, 86, 86, 83, 83, 87, 85, 83, 89, 88, 84, 89, 89],
9: [99, 92, 99, 98, 91, 94, 97, 92, 98, 97, 91, 93, 99]})
对于更复杂的分组,您可以使 key
函数任意复杂。例如,要将 70 岁及以上的所有人归为一组,请使用 lambda x: min(x // 10, 7)
。这适用于两种方法。如果您愿意,您甚至可以将密钥转换为字符串:
In [23]: keyfunc = lambda x: "{0}0-{0}9".format(x//10) if x < 70 else "70+"
In [24]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=keyfunc)]
Out[24]:
[('10-19', [18]),
('20-29', [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
('30-39', [30, 30, 32, 32, 34, 35, 36, 37, 37]),
('40-49', [41, 42, 42, 43, 43, 44, 45, 47, 48]),
('50-59', [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
('60-69', [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
('70+', [all the rest]]
嵌套循环不是最简单的解决方案吗?
import random
ages = [random.randint(18, 100) for _ in range(100)]
age_ranges = [(18,29), (30,39), (40,49), (50,59), (60,69),(70,)]
for a in ages:
for r in age_ranges:
if a >= r[0] and (len(r) == 1 or a < r[1]):
print a,r
break
一位朋友想出了这个有效的离线回复:
def age_buckets(x):
如果 x < 30:
return '18-29'
elif x < 40:
return '30-39'
elif x < 50:
return '40-49'
elif x < 60:
return '50-59'
elif x < 70:
return '60-69'
elif x >=70:
return '70+'
别的:
return'other'
file['agerange'] = file.age.apply(age_buckets)
感谢所有参与其中的人!
您似乎在使用 Pandas 库。它们包括执行此操作的功能:http://pandas.pydata.org/pandas-docs/version/0.16.0/generated/pandas.cut.html
这是我的尝试:
import pandas as pd
ages = pd.DataFrame([81, 42, 18, 55, 23, 35], columns=['age'])
bins = [18, 30, 40, 50, 60, 70, 120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
ages['agerange'] = pd.cut(ages.age, bins, labels = labels,include_lowest = True)
print(ages)
age agerange
0 81 70+
1 42 40-49
2 18 18-29
3 55 50-59
4 23 18-29
5 35 30-39
我有一列年龄值需要转换为 18-29、30-39、40-49、50-59、60-69 和 70+ 的年龄范围:
以 df 'file' 中的一些数据为例,我有:
并想去:
我尝试了以下方法:
file['agerange'] = file[['age']].apply(lambda x: "18-29" if (x[0] > 16
or x[0] < 30) else "other")
我不想只做一个 groupby,因为桶的大小不统一,但如果可行的话,我愿意接受它作为解决方案。
提前致谢!
您可以使用 itertools.groupby
使用 // 10
作为键函数。
In [10]: ages = [random.randint(18, 99) for _ in range(100)]
In [11]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=lambda x: x // 10)]
Out[11]:
[(1, [18]),
(2, [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
(3, [30, 30, 32, 32, 34, 35, 36, 37, 37]),
(4, [41, 42, 42, 43, 43, 44, 45, 47, 48]),
(5, [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
(6, [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
(7, [71, 71, 72, 72, 73, 75, 75, 77, 77, 78]),
(8, [83, 83, 83, 83, 84, 84, 85, 86, 86, 87, 87, 88, 89, 89, 89]),
(9, [91, 91, 92, 92, 93, 94, 97, 97, 98, 98, 99, 99, 99])]
请记住,groupby
需要排序的数据,因此请先排序。或者使用字典和循环手动完成。
In [14]: groups = collections.defaultdict(list)
In [15]: for x in ages:
....: groups[x//10].append(x)
In [16]: groups
Out[16]: defaultdict(<type 'list'>, {1: [18],
2: [26, 28, 21, 20, 26, 24, 21, 27, 25, 23, 27, 26, 22],
3: [37, 30, 32, 32, 35, 30, 36, 37, 34],
4: [45, 42, 43, 41, 47, 43, 48, 44, 42],
5: [52, 56, 58, 55, 58, 51, 58, 58, 57, 56, 53, 56, 50, 54, 56],
6: [69, 65, 62, 61, 65, 66, 66, 62, 69, 66, 67, 66, 60, 62, 69],
7: [71, 77, 71, 72, 77, 73, 78, 72, 75, 75],
8: [87, 83, 84, 86, 86, 83, 83, 87, 85, 83, 89, 88, 84, 89, 89],
9: [99, 92, 99, 98, 91, 94, 97, 92, 98, 97, 91, 93, 99]})
对于更复杂的分组,您可以使 key
函数任意复杂。例如,要将 70 岁及以上的所有人归为一组,请使用 lambda x: min(x // 10, 7)
。这适用于两种方法。如果您愿意,您甚至可以将密钥转换为字符串:
In [23]: keyfunc = lambda x: "{0}0-{0}9".format(x//10) if x < 70 else "70+"
In [24]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=keyfunc)]
Out[24]:
[('10-19', [18]),
('20-29', [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
('30-39', [30, 30, 32, 32, 34, 35, 36, 37, 37]),
('40-49', [41, 42, 42, 43, 43, 44, 45, 47, 48]),
('50-59', [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
('60-69', [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
('70+', [all the rest]]
嵌套循环不是最简单的解决方案吗?
import random
ages = [random.randint(18, 100) for _ in range(100)]
age_ranges = [(18,29), (30,39), (40,49), (50,59), (60,69),(70,)]
for a in ages:
for r in age_ranges:
if a >= r[0] and (len(r) == 1 or a < r[1]):
print a,r
break
一位朋友想出了这个有效的离线回复: def age_buckets(x): 如果 x < 30: return '18-29' elif x < 40: return '30-39' elif x < 50: return '40-49' elif x < 60: return '50-59' elif x < 70: return '60-69' elif x >=70: return '70+' 别的: return'other'
file['agerange'] = file.age.apply(age_buckets)
感谢所有参与其中的人!
您似乎在使用 Pandas 库。它们包括执行此操作的功能:http://pandas.pydata.org/pandas-docs/version/0.16.0/generated/pandas.cut.html
这是我的尝试:
import pandas as pd
ages = pd.DataFrame([81, 42, 18, 55, 23, 35], columns=['age'])
bins = [18, 30, 40, 50, 60, 70, 120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
ages['agerange'] = pd.cut(ages.age, bins, labels = labels,include_lowest = True)
print(ages)
age agerange
0 81 70+
1 42 40-49
2 18 18-29
3 55 50-59
4 23 18-29
5 35 30-39