如何从频率数据中找到分位数?
How to find quantile from frequency data?
假设我有一个 table 客户购买商品的数据:
Customer|Price|Quantity Sold
a | 200 | 3.3
b | 120 | 4.1
c | 040 | 12.0
d | 030 | 16.76
这应该是 table 数据的粗略表示,包含 相同 产品的客户、价格和销售数量。
我想弄清楚如何计算此信息的中位数购买价格。
我对方法有点困惑,因为我知道在 pandas 中获得分位数很容易,因为 data[row].quantile(x)
但由于每一行实际上代表了不止一个观察结果,所以我不确定如何
得到分位数。
编辑:最重要的是,主要问题是销售数量不谨慎。是连续变量。 (我们就像在谈论米、公斤等,所以创建更多行不是一种选择。)
您可以遍历销售数量并将每件商品添加到一个大 list_of_all_sold(也有其他方法可以做到这一点,这是一个示例):
c = ['a', 'b', 'c']
p = [200, 120, 40]
qs = [3,4,12]
list_of_all_sold = []
for i in range(len(qs)):
for x in range(qs[i]):
a.append(p[i])
然后,Python 3.4+ 有一个统计包可以用来求中位数:
from statistics import median
median(list_of_all_sold)
编辑以查找连续供应数量的中位数:
您可以制作一个 pandas 数据框,然后按价格对数据框进行排序,然后找到中位数并减去排序数据框中每个价格点的销量,逐行查找直到找到中点。像这样:
c = ['a', 'b', 'c', 'd']
p = [200, 120, 40, 30]
qs = [3.3, 4.1, 12.0, 16.76]
# Create a pandas dataframe
import pandas as pd
df = pd.DataFrame({'price' : p, 'qs' : qs}, index = c)
# Find the index of the median number
median_num_idx = sum(qs) / 2
# Go down dataframe sorted by price
for index, row in df.sort_values('price').iterrows():
# Subtract the quantity sold at that price point from the median number index
median_num_idx = median_num_idx - row['qs']
# Check if you have reach the median index point
if median_num_idx <= 0:
print (row['price'])
break
对于一组离散值,通过排序和取中心值找到中位数。但是,由于您具有 Quantity
的连续值,因此您似乎真的在寻找概率分布的中值,其中 Price
分布的相对频率由 Quantity
给出。通过对数据进行排序并进行累积 Quantity
,我们可以得出您的问题的图形表示:
从这个图中可以看出,中值为 40(X 中点处的 y 值)。这应该是预料之中的,因为以两个最低价格出售的数量非常大。中位数可以从您的数据框中计算如下:
df = df.sort_values('Price')
cumul = df['Quantity Sold'].cumsum()
# Get the row index where the cumulative quantity reaches half the total.
total = df['Quantity Sold'].sum()
index = sum(cumul < 0.5 * total)
# Get the price at that index
result = df['Price'].iloc[index]
相同数据的任何其他分位数都可以使用总数的不同比率来计算。
我正在搜索“计算频率数据的中位数”并最终来到这里,令我感到失望的是,提出该问题的各种变体基本上产生了相同的结果:将问题视为给定的值列表并计算中位数。虽然这可能是完全正确的,但在大多数实际情况下,频率数据驻留在(如本例中)有序类别列表中,并且在非平凡的情况下,在类别内具有一系列值。鉴于这种形式,问题不在于哪个区间包含中位数,而在于什么是对区间内中位数所在位置的良好估计。美国人口普查局通常采用区间内线性插值的技术。初始基础是相同的:找到包含中位数的区间。然后创建一个线性插值(你可以通过样条插值等获得幻想)。代码如下所示:
def calc_quantile(freqs, bnds, aquantile):
"""
Calculate an interpolated quantile from distribution of
frequencies counts (or percents) and their boundary
definitions. If there are n intervals the arrays are
must be of length n+1.
freqs: length = n+1. A distribution of numbers >= 0
representing counts, weights or percents. For
consistency in indexing the first value, freq[0],
must be present but is not used (helps in
visualizing what is going on).
bnds: - an array of n+1 numbers which provides the
definition of the boundary levels. The assumed
relationship is that bnds[i] < bnds[i+1]. bnds[0]
represents the lower bound of freqs[1] and bnds[n]
is the upper bound for interval n. These should
represent reasonable values. For example, the lower
bound (bnds[0]) for a first interval representing
adults under 20 years of age would be 18. For a top
interval for adults 75 and older, might be 95. When
all the population lies within an interval – the
returned estimate for the median would be average of
the top and bottom interval values. In this example
if all values were in the top interval the result
would be 85, an ok general guess.
q: the value of the quantile must be > 0 and < 1.
median = 0.5
"""
# Create the cumulative fractional distribution
cume = np.cumsum(x)/sum(x)
# find the median interval
i = np.argmax(cume >= aquantile)
# interpolate a value:
# calculate fraction of interval to cover
# width of frequency interval:
# cume[i] - cume[i-1])
# amount under the quantile is:
# (aquantile - cume[i-1])
f1 = (aquantile - cume[i-1])/(cume[i] - cume[i-1])
# the width of the bounds interval is: wb = bnds[i] -
# bnds[i-1]
# bnds[i] is upper bound of interval thus the quantile
# is lower bound plus the desired fraction of the width
# of the interval
return bnds[i-1] + f1*(bnds[i] - bnds[i-1])
鉴于所提供的案例,以下代码将产生答案 31.0999,如果数据分布在区间内,则它是比 40
更明智的估计
calc_quantile([0, 16.76, 12.0, 4.1, 3.3], [0, 30, 40, 120, 200], 0.5)
或使用 pandas 数据框:
df = pd.DataFrame.from_dict({'Customer': ['a', 'b', 'c', 'd'],
'Price': [200, 120, 40, 30],
'Quantity Sold': [3.3, 4.1, 12.0, 16.76]}
).set_index('Customer')
df = df.sort_values('Price')
calc_quantile(np.insert(df['Quantity Sold'].values, 0, 0), np.insert(df.Price.values, 0, 0), 0.5)
假设我有一个 table 客户购买商品的数据:
Customer|Price|Quantity Sold
a | 200 | 3.3
b | 120 | 4.1
c | 040 | 12.0
d | 030 | 16.76
这应该是 table 数据的粗略表示,包含 相同 产品的客户、价格和销售数量。
我想弄清楚如何计算此信息的中位数购买价格。
我对方法有点困惑,因为我知道在 pandas 中获得分位数很容易,因为 data[row].quantile(x)
但由于每一行实际上代表了不止一个观察结果,所以我不确定如何 得到分位数。
编辑:最重要的是,主要问题是销售数量不谨慎。是连续变量。 (我们就像在谈论米、公斤等,所以创建更多行不是一种选择。)
您可以遍历销售数量并将每件商品添加到一个大 list_of_all_sold(也有其他方法可以做到这一点,这是一个示例):
c = ['a', 'b', 'c']
p = [200, 120, 40]
qs = [3,4,12]
list_of_all_sold = []
for i in range(len(qs)):
for x in range(qs[i]):
a.append(p[i])
然后,Python 3.4+ 有一个统计包可以用来求中位数:
from statistics import median
median(list_of_all_sold)
编辑以查找连续供应数量的中位数:
您可以制作一个 pandas 数据框,然后按价格对数据框进行排序,然后找到中位数并减去排序数据框中每个价格点的销量,逐行查找直到找到中点。像这样:
c = ['a', 'b', 'c', 'd']
p = [200, 120, 40, 30]
qs = [3.3, 4.1, 12.0, 16.76]
# Create a pandas dataframe
import pandas as pd
df = pd.DataFrame({'price' : p, 'qs' : qs}, index = c)
# Find the index of the median number
median_num_idx = sum(qs) / 2
# Go down dataframe sorted by price
for index, row in df.sort_values('price').iterrows():
# Subtract the quantity sold at that price point from the median number index
median_num_idx = median_num_idx - row['qs']
# Check if you have reach the median index point
if median_num_idx <= 0:
print (row['price'])
break
对于一组离散值,通过排序和取中心值找到中位数。但是,由于您具有 Quantity
的连续值,因此您似乎真的在寻找概率分布的中值,其中 Price
分布的相对频率由 Quantity
给出。通过对数据进行排序并进行累积 Quantity
,我们可以得出您的问题的图形表示:
从这个图中可以看出,中值为 40(X 中点处的 y 值)。这应该是预料之中的,因为以两个最低价格出售的数量非常大。中位数可以从您的数据框中计算如下:
df = df.sort_values('Price')
cumul = df['Quantity Sold'].cumsum()
# Get the row index where the cumulative quantity reaches half the total.
total = df['Quantity Sold'].sum()
index = sum(cumul < 0.5 * total)
# Get the price at that index
result = df['Price'].iloc[index]
相同数据的任何其他分位数都可以使用总数的不同比率来计算。
我正在搜索“计算频率数据的中位数”并最终来到这里,令我感到失望的是,提出该问题的各种变体基本上产生了相同的结果:将问题视为给定的值列表并计算中位数。虽然这可能是完全正确的,但在大多数实际情况下,频率数据驻留在(如本例中)有序类别列表中,并且在非平凡的情况下,在类别内具有一系列值。鉴于这种形式,问题不在于哪个区间包含中位数,而在于什么是对区间内中位数所在位置的良好估计。美国人口普查局通常采用区间内线性插值的技术。初始基础是相同的:找到包含中位数的区间。然后创建一个线性插值(你可以通过样条插值等获得幻想)。代码如下所示:
def calc_quantile(freqs, bnds, aquantile):
"""
Calculate an interpolated quantile from distribution of
frequencies counts (or percents) and their boundary
definitions. If there are n intervals the arrays are
must be of length n+1.
freqs: length = n+1. A distribution of numbers >= 0
representing counts, weights or percents. For
consistency in indexing the first value, freq[0],
must be present but is not used (helps in
visualizing what is going on).
bnds: - an array of n+1 numbers which provides the
definition of the boundary levels. The assumed
relationship is that bnds[i] < bnds[i+1]. bnds[0]
represents the lower bound of freqs[1] and bnds[n]
is the upper bound for interval n. These should
represent reasonable values. For example, the lower
bound (bnds[0]) for a first interval representing
adults under 20 years of age would be 18. For a top
interval for adults 75 and older, might be 95. When
all the population lies within an interval – the
returned estimate for the median would be average of
the top and bottom interval values. In this example
if all values were in the top interval the result
would be 85, an ok general guess.
q: the value of the quantile must be > 0 and < 1.
median = 0.5
"""
# Create the cumulative fractional distribution
cume = np.cumsum(x)/sum(x)
# find the median interval
i = np.argmax(cume >= aquantile)
# interpolate a value:
# calculate fraction of interval to cover
# width of frequency interval:
# cume[i] - cume[i-1])
# amount under the quantile is:
# (aquantile - cume[i-1])
f1 = (aquantile - cume[i-1])/(cume[i] - cume[i-1])
# the width of the bounds interval is: wb = bnds[i] -
# bnds[i-1]
# bnds[i] is upper bound of interval thus the quantile
# is lower bound plus the desired fraction of the width
# of the interval
return bnds[i-1] + f1*(bnds[i] - bnds[i-1])
鉴于所提供的案例,以下代码将产生答案 31.0999,如果数据分布在区间内,则它是比 40
更明智的估计calc_quantile([0, 16.76, 12.0, 4.1, 3.3], [0, 30, 40, 120, 200], 0.5)
或使用 pandas 数据框:
df = pd.DataFrame.from_dict({'Customer': ['a', 'b', 'c', 'd'],
'Price': [200, 120, 40, 30],
'Quantity Sold': [3.3, 4.1, 12.0, 16.76]}
).set_index('Customer')
df = df.sort_values('Price')
calc_quantile(np.insert(df['Quantity Sold'].values, 0, 0), np.insert(df.Price.values, 0, 0), 0.5)