根据 Python 中的事件时间创建概率 table

Create a probability table based on the time of the event in Python

我有一个大学项目的数据集,是在对数据进行一些操作后得到的:

df = d = pd.DataFrame({
'duplicates': [
     [('007', "us1", "us2", "time1", 'time2', 4)],
     [('008', "us1", "us2", "time1", 'time2', 5)],
     [('009', "us1", "us2", "time1", 'time2', 6)],
     [('007', 'us2', "us3", "time1", 'time2', 4)],
     [('008', 'us2', "us3", "time1", 'time2', 7)], 
     [('009', 'us2', "us3", "time1", 'time2', 11)], 
     [('001', 'us5', 'us1', "time1", 'time2', 0)], 
     [('008', 'us5', 'us1', "time1", 'time2', 19)], 
     [('007',"us3", "us2", "time1", 'time2', 2)],
     [('007',"us3", "us2", "time1", 'time2', 34)],
     [('009',"us3", "us2", "time1", 'time2', 67)]],
'numberOfInteractions': [1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 11]
   })

'duplicates' 是一个元组:(ID, USER1, USER2, TIME USER1, TIME USER2, DELAY BETWEEN TIMES)

现在我必须通过计算交互来创建概率 table 用户 x 用户,因此对于列 us2 我们有 (1 + 2 + 3)/19,Na/19, (11+1+1)/19。在这种情况下,1 + 2 + 3 是数据 (df[us1,us2]) 之间的 numberOfInteractions(第一张图片的第 0 行到第 2 行)。

代码在这里:

    df['duplicates'] = df.apply(
            lambda x: [(x['numberOfInteractions'],a, b, c, d, e,f) for a, b, c, d, e, f in x.duplicates], 1)


df =(pd.DataFrame(df["duplicates"].explode().tolist(),
                  columns=["numberOfInteractions", "ID","USER1","USER2","TAU1","TAU2","DELAY"])
     .groupby(["USER1","USER2"])["numberOfInteractions"]
     .agg(sum).to_frame().unstack())


df.columns = df.columns.get_level_values(1)
combined = df.index|df.columns
for col in combined:
    if col not in df.columns:
        df[col] = np.nan
    df[col] = df[col] / df[col].sum(skipna=True)

这里的问题是我想要一个基于元组最后一部分的概率(DELAY BETWEEN TIMES)

因此,例如,'us5', 'us1' 有两个交互,一个延迟为 19,另一个延迟为 0(第一张图片的第 6 和 7 行),因此我想在一个元组上有这个概率(less than 5, less than 19, less than 60, less than 80, less than 98),所以在这种情况下,df['us5','us1'] 将是:(7/15, 8/15, 0/15, 0/15, 0/15)而不是今天的 1(因为我的算法是添加 (8+7)/15,所以它是 1)。

这是想法,但我什至不知道如何开始。

我认为你有两条路可以走。

要么使用基于延迟和 numberOfInteractions 的新列(我会做的):

def mapToNbOfInteractionsPerDelay(group):
    nbOfInteractions = group['numberOfInteractions']
    delay = group['DELAY']

    if(delay <= 5):
        return (nbOfInteractions, 0, 0, 0, 0)
    elif(delay <= 19):
        return (0, nbOfInteractions, 0, 0, 0)
    elif(delay <= 60):
        return (0, 0, nbOfInteractions, 0, 0)
    elif(delay <= 80):
        return (0, 0, 0, nbOfInteractions, 0)
    else:
        return (0, 0, 0, 0, nbOfInteractions)


df["nbOfInteractionsPerDelay"] = df[["DELAY", "numberOfInteractions"]].apply(mapToNbOfInteractionsPerDelay, axis=1)

然后你可以去:

df = (df.groupby(["USER1","USER2"])["nbOfInteractionsPerDelay"]
        .agg(lambda l : tuple([sum(x) for x in zip(*l)])).to_frame().unstack())

然后会给你这个:

      nbOfInteractionsPerDelay                                    
USER2                      us1               us2               us3
USER1                                                            
us1                        NaN   (3, 3, 0, 0, 0)               NaN
us2                        NaN               NaN  (4, 11, 0, 0, 0)
us3                        NaN  (1, 0, 1, 11, 0)               NaN
us5            (7, 8, 0, 0, 0)               NaN               NaN

从那里,您可以轻松获得您所期望的。

或者您将数据帧拆分为 5 个其他数据帧,每个数据帧具有特定延迟子集的值,然后合并。