如何编写一个函数来平均数据帧列中每个数组中的每个第 n 个数字?

How to write a function to average every nth number in each array in a column of a dataframe?

我有一个 pandas 数据框,如下所示:

FileName    Num     Onsets          Offsets          Durations         
FileName1   3       [19, ..., 1023] [188, ..., 1252] [169, ..., 229] 
FileName2   5       [52, ..., 2104] [472, ..., 2457] [420, ..., 353] 
FileName3   4       [18, ..., 1532] [356, ..., 2018] [338, ..., 486] 

这显示时间序列中事件的开始和偏移时间、每个事件的持续时间以及每个事件之间的时间。每个时间序列实际上是一小组事件的重复,每组事件的数量在 Num 列中。例如,在第一行中,Onsets、Offsets 和 Durations 可能各有 12 个值,这意味着基础事件集重复了 4 次。换句话说,在每一列中,模式看起来像 [a,b,c,a,b,c,a,b,c,a,b,c]。

我需要为 Durations 列找到基础集合中每个头寸的平均值。这意味着每个 a 持续时间、每个 b 持续时间、每个 c 持续时间等的平均值.然后我需要将这些平均值附加到新列中的数据框。这意味着 Averages 列中数组的长度将等于 Num 列中的值。

我假设要做的事情是创建一个包含 for 循环的函数,该循环将遍历每一行,在 Durations 中为每个数字编制索引,根据 Num 列中的值平均每第 n 个数字,创建并附加一个新的dataframe 来存储这些平均值,然后将原始 dataframe 与新的作为一列附加。

我想它可能看起来像下面这样,但我是 Python 和编码的新手,所以我不确定:

Duration = np.empty(len(Onsets))

def averages(data): 
    
    for ionset,onset in enumerate(onsets):

        Duration[ionset] = #what I described above

我怎样才能做到这一点?

这可以通过使用 pandas 中的应用功能来实现 参考:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

import pandas as pd

df = pd.DataFrame({
    "FileName": ["FileName1"],
    "Num": 3,
    "Onsets": [[10, 11, 12, 34, 53, 22, 56, 24, 63, 24, 35, 1]],
    "Offsets": [[13, 25, 2, 35, 63, 23, 63, 23, 765, 24, 6, 1]],
    "Durations": [[1, 356, 6, 2, 6, 2, 2 , 2, 6, 65, 23, 2]]
})

def calculate_average(num, values):
    n_values = len(values) / num
    averages = list()
    for i in range(0, num):
        summation = 0
        for j in range(0, int(n_values)):
            summation += values[j*num]
        averages.append(summation/n_values)
    return averages
df["Onsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Onsets"]), axis=1)
df["Offsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Offsets"]), axis=1)
df["Durations_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Durations"]), axis=1)

这是为什么应避免将可迭代对象作为值保存在数据框中的原因之一。大多数 pandas 功能不同意这种结构。

也就是说,您可以通过 df.explodedf.transform 以及其他技巧来获得解决方案。

import pandas as pd
import numpy as np

# sample data
# please always provide a callable line of code with your data
# you can get it with df.head(10).to_dict('split')
# read more about this in 
# and 
np.random.seed(1)
df = pd.DataFrame({
    'FileName': ['FileName1', 'FileName2', 'FileName3'],
    'Num': [3, 5, 4],
    'Onsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
    'Offsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
    'Durations': [np.random.randint(1, 1000, 12) for _ in range(3)]
})
print(df)

    FileName  Num                                             Onsets  \
0  FileName1    3  [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1  FileName2    5  [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2  FileName3    4  [908, 253, 491, 669, 926, 399, 563, 581, 216, ...

                                             Offsets  \
0  [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1  [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2  [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...

                                           Durations
0  [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1  [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2  [934, 314, 596, 167, 649, 289, 419, 779, 280, ...

代码

# work with a temp dataframe
df2 = df[['FileName', 'Num', 'Durations']].explode('Durations')
df2.Durations = df2.Durations.astype(int) # needed only because of how the sample was created
# should not be necessary with your dataframe

df2['tag'] = ( # add cyclic tags to each row, within each FileName
    df2.groupby('FileName').Durations.transform('cumcount') # similar to range(len(group))
    % df2.Num # get the modulo of the row number within the group
)

# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag']).Durations.mean() # get average
df2.rename('Duration_avgs', inplace=True)

# collect in a list by Filename and merge with original df
df = df.merge(df2.groupby('FileName').agg(list), on='FileName')

输出

    FileName  Num                                             Onsets  \
0  FileName1    3  [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1  FileName2    5  [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2  FileName3    4  [908, 253, 491, 669, 926, 399, 563, 581, 216, ...

                                             Offsets  \
0  [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1  [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2  [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...

                                           Durations  \
0  [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1  [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2  [934, 314, 596, 167, 649, 289, 419, 779, 280, ...

                                      Durations_avgs
0                             [267.0, 506.25, 349.5]
1  [679.6666666666666, 382.3333333333333, 396.5, ...
2  [621.0, 419.6666666666667, 589.0, 344.66666666...

更新

Kshitij 展示了为此定义一个函数的好主意(如果您想为多个列执行此操作)。但是,如果可以使用 pandas 本机函数来完成,则最好避免 apply

这是一个为任何列动态执行此操作的函数:

def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
    '''
    Add a column inplace, with the averages
    of each `Num` cyclical item for each row
    '''
    # work with a new dataframe
    df2 = (
        df[['FileName', 'Num', column]]
        .explode('Durations', ignore_index=True)
    )
    
    # needed only because of how the sample was created
    # should not be necessary with your dataframe
    df2.Durations = df2.Durations.astype(int)
    
    df2['tag'] = ( # add cyclic tags to each row, within each FileName
        df2.groupby('FileName')[column]
            .transform('cumcount') # similar to range(len(group))
        % df2.Num # get the modulo of the row number within the group
    )
    
    # get averages and collect into lists
    df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
    df2.rename(f'{column}_avgs', inplace=True)
    
    # collect in a list by Filename and merge with original df
    df2 = df2.groupby('FileName').agg(list)
    df = df.merge(df2, on='FileName')
    
    return df


df = get_averages(df, 'Durations')