Pandas 按分箱数据分组的中位数

Question

我有一个包含用户、分数、次数的数据框，其中列出了每个用户的不同分数和他们收到分数的次数：

user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2

等等。我想为每个用户计算分数的中位数。为此，我想我应该创建一个行重复的 df，例如 -

user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10

然后使用 groupBy 并应用以某种方式计算中位数？

我的问题 -

这是正确的方法吗？我的 df 非常大，所以解决方案必须具有时间效率。
如果这确实是要走的路 - 你能告诉我怎么做吗？无论我尝试做什么，它总是失败。

Answer 1

我相信你需要 weighted median. I used function weighted_median from here，你也可以试试 wquantile 的 weighted.median，但它的插值方式有点不同，所以你可能会得到意想不到的结果):

import numpy as np
import pandas as pd

# from here:  CC BY-SA by Afshin @ SE
def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]

# end from

def wmed(group):
    return weighted_median(group['score'], group['times'])

import pandas as pd
df = pd.DataFrame([
        ['user1', 1, 4],
        ['user1', 7, 2],
        ['user2', 3, 1],
        ['user2', 10, 2]
        ], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)

# user
# user1     1
# user2    10
# dtype: int64

Answer 2

df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'], 
                   'score': [1, 7, 3, 10], 
                   'times': [4, 2, 1, 2]})

# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}

# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]

>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}

# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}

Pandas 按分箱数据分组的中位数

Pandas median over grouped by binned data

python

group-by

apply

pandas