Pandas 按分箱数据分组的中位数
Pandas median over grouped by binned data
我有一个包含用户、分数、次数的数据框,其中列出了每个用户的不同分数和他们收到分数的次数:
user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2
等等。
我想为每个用户计算分数的中位数。
为此,我想我应该创建一个行重复的 df,例如 -
user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10
然后使用 groupBy 并应用以某种方式计算中位数?
我的问题 -
- 这是正确的方法吗?我的 df 非常大,所以解决方案必须具有时间效率。
- 如果这确实是要走的路 - 你能告诉我怎么做吗?无论我尝试做什么,它总是失败。
我相信你需要 weighted median. I used function weighted_median
from here,你也可以试试 wquantile
的 weighted.median
,但它的插值方式有点不同,所以你可能会得到意想不到的结果):
import numpy as np
import pandas as pd
# from here: CC BY-SA by Afshin @ SE
def weighted_median(values, weights):
''' compute the weighted median of values list. The
weighted median is computed as follows:
1- sort both lists (values and weights) based on values.
2- select the 0.5 point from the weights and return the corresponding values as results
e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
sorted values = [0, 1, 3] and corresponding sorted weights = [0.6, 0.1, 0.3] the 0.5 point on
weight corresponds to the first item which is 0. so the weighted median is 0.'''
#convert the weights into probabilities
sum_weights = sum(weights)
weights = np.array([(w*1.0)/sum_weights for w in weights])
#sort values and weights based on values
values = np.array(values)
sorted_indices = np.argsort(values)
values_sorted = values[sorted_indices]
weights_sorted = weights[sorted_indices]
#select the median point
it = np.nditer(weights_sorted, flags=['f_index'])
accumulative_probability = 0
median_index = -1
while not it.finished:
accumulative_probability += it[0]
if accumulative_probability > 0.5:
median_index = it.index
return values_sorted[median_index]
elif accumulative_probability == 0.5:
median_index = it.index
it.iternext()
next_median_index = it.index
return np.mean(values_sorted[[median_index, next_median_index]])
it.iternext()
return values_sorted[median_index]
# end from
def wmed(group):
return weighted_median(group['score'], group['times'])
import pandas as pd
df = pd.DataFrame([
['user1', 1, 4],
['user1', 7, 2],
['user2', 3, 1],
['user2', 10, 2]
], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)
# user
# user1 1
# user2 10
# dtype: int64
df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'],
'score': [1, 7, 3, 10],
'times': [4, 2, 1, 2]})
# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}
# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]
>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}
# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}
我有一个包含用户、分数、次数的数据框,其中列出了每个用户的不同分数和他们收到分数的次数:
user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2
等等。 我想为每个用户计算分数的中位数。 为此,我想我应该创建一个行重复的 df,例如 -
user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10
然后使用 groupBy 并应用以某种方式计算中位数?
我的问题 -
- 这是正确的方法吗?我的 df 非常大,所以解决方案必须具有时间效率。
- 如果这确实是要走的路 - 你能告诉我怎么做吗?无论我尝试做什么,它总是失败。
我相信你需要 weighted median. I used function weighted_median
from here,你也可以试试 wquantile
的 weighted.median
,但它的插值方式有点不同,所以你可能会得到意想不到的结果):
import numpy as np
import pandas as pd
# from here: CC BY-SA by Afshin @ SE
def weighted_median(values, weights):
''' compute the weighted median of values list. The
weighted median is computed as follows:
1- sort both lists (values and weights) based on values.
2- select the 0.5 point from the weights and return the corresponding values as results
e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
sorted values = [0, 1, 3] and corresponding sorted weights = [0.6, 0.1, 0.3] the 0.5 point on
weight corresponds to the first item which is 0. so the weighted median is 0.'''
#convert the weights into probabilities
sum_weights = sum(weights)
weights = np.array([(w*1.0)/sum_weights for w in weights])
#sort values and weights based on values
values = np.array(values)
sorted_indices = np.argsort(values)
values_sorted = values[sorted_indices]
weights_sorted = weights[sorted_indices]
#select the median point
it = np.nditer(weights_sorted, flags=['f_index'])
accumulative_probability = 0
median_index = -1
while not it.finished:
accumulative_probability += it[0]
if accumulative_probability > 0.5:
median_index = it.index
return values_sorted[median_index]
elif accumulative_probability == 0.5:
median_index = it.index
it.iternext()
next_median_index = it.index
return np.mean(values_sorted[[median_index, next_median_index]])
it.iternext()
return values_sorted[median_index]
# end from
def wmed(group):
return weighted_median(group['score'], group['times'])
import pandas as pd
df = pd.DataFrame([
['user1', 1, 4],
['user1', 7, 2],
['user2', 3, 1],
['user2', 10, 2]
], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)
# user
# user1 1
# user2 10
# dtype: int64
df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'],
'score': [1, 7, 3, 10],
'times': [4, 2, 1, 2]})
# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}
# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]
>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}
# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}