如果您有数字间隔,请计算分组中位数
Calculate Grouped Median if you have an numeric interval
这是我的数据框,其中包含间隔号 (classes)。
df = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11],
'Class Interval': ['16.25-18.75', '18.75-21.25', '21.25-23.75',
'23.75-26.25', '26.25-28.75', '28.75-31.25',
'31.25-33.75', '33.75-36.25', '36.25-38.75',
'38.75-41.25', '41.25-43.75'],
'' : [2,7,7,14,17,24,11,11,3,3,1],
'Cumulative ': [2,9,16,30,47,71,82,93,96,99,100],
'/n' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
'Cumulative /n' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df
Class Class Interval Cumulative / Cumulative /
0 1 16.25-18.75 2 2 0.02 0.02
1 2 18.75-21.25 7 9 0.07 0.09
2 3 21.25-23.75 7 16 0.07 0.16
3 4 23.75-26.25 14 30 0.14 0.30
4 5 26.25-28.75 17 47 0.17 0.47
5 6 28.75-31.25 24 71 0.24 0.71
6 7 31.25-33.75 11 82 0.11 0.82
7 8 33.75-36.25 11 93 0.11 0.93
8 9 36.25-38.75 3 96 0.03 0.96
9 10 38.75-41.25 3 99 0.03 0.99
10 11 41.25-43.75 1 100 0.01 1.00
问题:如何使用 python 计算此数据帧的分组中位数?
这可以手动完成,结果是 29.06。
我试过了'median_grouped':
# importing median_grouped from the statistics module
from statistics import median_grouped
# printing median_grouped for the set
print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
但是我得到了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-491000133032> in <module>
4
5 # printing median_grouped for the set
----> 6 print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
~\Anaconda3\ANACONDA\lib\statistics.py in median_grouped(data, interval)
463 for obj in (x, interval):
464 if isinstance(obj, (str, bytes)):
--> 465 raise TypeError('expected number but got %r' % obj)
466 try:
467 L = x - interval/2 # The lower limit of the median interval.
TypeError: expected number but got '28.75-31.25'
比起我尝试制作两列(一个是下限,一个是上限),但他只给了我下限 (28.75) / 上限中位数 (31.25)。我也只试过下限,当然他也给我 28.75。
我没有间隔内的值,所以我无法重新制作要用 pd.cut 切割的值列表并正确尝试(我不想猜测),但我'我还尝试手动将 class 间隔放入垃圾箱(例如 16.25-18.25 比 (16.25,18.25],但我收到错误消息:TypeError: unorderable types: Interval() < float()
是否有可能使具有间隔数字而不是字符串的列能够使用 Python 自动计算分组中位数?
我首先将您的间隔转换为 lower bound
(lb) 和 upper bound
(ub)
的两个单独的列
df = (df.join(df['Class Interval'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lb', 1: 'ub'}))
.drop('Class Interval', 1))
然后,看起来你可以直接写出公式
m = len(df)//2
gmedian = df.loc[m, 'lb'] + ((df[''].sum()/2 - df.loc[m - 1, 'Cumulative '])/(df.loc[m, '']))*(df['ub'] - df['lb']).loc[m]
或者,以更说教的方式,
L = df.loc[m, 'lb']
N = df[''].sum()
F = df.loc[m - 1, 'Cumulative ']
f = df.loc[m, '']
C = (df['ub'] - df['lb']).loc[m]
gmedian = L + ((N/2 - F)/(f))*C
产出
29.0625
您可以重新创建一个包含相同统计信息的人工数据点列表(每个区间的中间值 * fi 区间),并且 运行 mean_grouped
函数在其中:
# Obtaining lower, upper and middle interval value
df['lower'] = df['Class Interval'].str.split('-', expand=True)[0].astype(float)
df['upper'] = df['Class Interval'].str.split('-', expand=True)[1].astype(float)
df['middle'] = (df['lower'] + df['upper'] ) / 2
# Generating an artificial list of values with the same statistical info
artificial_data_list = []
for index, row in df.iterrows():
artificial_data_list.append([row['middle']]*row[''])
flat_list = [item for sublist in artificial_data_list for item in sublist]
# Calcuating the right median with the statistics.mean_grouped function
median_grouped(flat_list,interval=2.5) # Attention to the interval size!
# => 29.0625
这是我的数据框,其中包含间隔号 (classes)。
df = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11],
'Class Interval': ['16.25-18.75', '18.75-21.25', '21.25-23.75',
'23.75-26.25', '26.25-28.75', '28.75-31.25',
'31.25-33.75', '33.75-36.25', '36.25-38.75',
'38.75-41.25', '41.25-43.75'],
'' : [2,7,7,14,17,24,11,11,3,3,1],
'Cumulative ': [2,9,16,30,47,71,82,93,96,99,100],
'/n' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
'Cumulative /n' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df
Class Class Interval Cumulative / Cumulative /
0 1 16.25-18.75 2 2 0.02 0.02
1 2 18.75-21.25 7 9 0.07 0.09
2 3 21.25-23.75 7 16 0.07 0.16
3 4 23.75-26.25 14 30 0.14 0.30
4 5 26.25-28.75 17 47 0.17 0.47
5 6 28.75-31.25 24 71 0.24 0.71
6 7 31.25-33.75 11 82 0.11 0.82
7 8 33.75-36.25 11 93 0.11 0.93
8 9 36.25-38.75 3 96 0.03 0.96
9 10 38.75-41.25 3 99 0.03 0.99
10 11 41.25-43.75 1 100 0.01 1.00
问题:如何使用 python 计算此数据帧的分组中位数?
这可以手动完成,结果是 29.06。
我试过了'median_grouped':
# importing median_grouped from the statistics module
from statistics import median_grouped
# printing median_grouped for the set
print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
但是我得到了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-491000133032> in <module>
4
5 # printing median_grouped for the set
----> 6 print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
~\Anaconda3\ANACONDA\lib\statistics.py in median_grouped(data, interval)
463 for obj in (x, interval):
464 if isinstance(obj, (str, bytes)):
--> 465 raise TypeError('expected number but got %r' % obj)
466 try:
467 L = x - interval/2 # The lower limit of the median interval.
TypeError: expected number but got '28.75-31.25'
比起我尝试制作两列(一个是下限,一个是上限),但他只给了我下限 (28.75) / 上限中位数 (31.25)。我也只试过下限,当然他也给我 28.75。
我没有间隔内的值,所以我无法重新制作要用 pd.cut 切割的值列表并正确尝试(我不想猜测),但我'我还尝试手动将 class 间隔放入垃圾箱(例如 16.25-18.25 比 (16.25,18.25],但我收到错误消息:TypeError: unorderable types: Interval() < float()
是否有可能使具有间隔数字而不是字符串的列能够使用 Python 自动计算分组中位数?
我首先将您的间隔转换为 lower bound
(lb) 和 upper bound
(ub)
df = (df.join(df['Class Interval'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lb', 1: 'ub'}))
.drop('Class Interval', 1))
然后,看起来你可以直接写出公式
m = len(df)//2
gmedian = df.loc[m, 'lb'] + ((df[''].sum()/2 - df.loc[m - 1, 'Cumulative '])/(df.loc[m, '']))*(df['ub'] - df['lb']).loc[m]
或者,以更说教的方式,
L = df.loc[m, 'lb']
N = df[''].sum()
F = df.loc[m - 1, 'Cumulative ']
f = df.loc[m, '']
C = (df['ub'] - df['lb']).loc[m]
gmedian = L + ((N/2 - F)/(f))*C
产出
29.0625
您可以重新创建一个包含相同统计信息的人工数据点列表(每个区间的中间值 * fi 区间),并且 运行 mean_grouped
函数在其中:
# Obtaining lower, upper and middle interval value
df['lower'] = df['Class Interval'].str.split('-', expand=True)[0].astype(float)
df['upper'] = df['Class Interval'].str.split('-', expand=True)[1].astype(float)
df['middle'] = (df['lower'] + df['upper'] ) / 2
# Generating an artificial list of values with the same statistical info
artificial_data_list = []
for index, row in df.iterrows():
artificial_data_list.append([row['middle']]*row[''])
flat_list = [item for sublist in artificial_data_list for item in sublist]
# Calcuating the right median with the statistics.mean_grouped function
median_grouped(flat_list,interval=2.5) # Attention to the interval size!
# => 29.0625