如何编写一个函数来平均数据帧列中每个数组中的每个第 n 个数字?
How to write a function to average every nth number in each array in a column of a dataframe?
我有一个 pandas 数据框,如下所示:
FileName Num Onsets Offsets Durations
FileName1 3 [19, ..., 1023] [188, ..., 1252] [169, ..., 229]
FileName2 5 [52, ..., 2104] [472, ..., 2457] [420, ..., 353]
FileName3 4 [18, ..., 1532] [356, ..., 2018] [338, ..., 486]
这显示时间序列中事件的开始和偏移时间、每个事件的持续时间以及每个事件之间的时间。每个时间序列实际上是一小组事件的重复,每组事件的数量在 Num 列中。例如,在第一行中,Onsets、Offsets 和 Durations 可能各有 12 个值,这意味着基础事件集重复了 4 次。换句话说,在每一列中,模式看起来像 [a,b,c,a,b,c,a,b,c,a,b,c]。
我需要为 Durations 列找到基础集合中每个头寸的平均值。这意味着每个 a 持续时间、每个 b 持续时间、每个 c 持续时间等的平均值.然后我需要将这些平均值附加到新列中的数据框。这意味着 Averages 列中数组的长度将等于 Num 列中的值。
我假设要做的事情是创建一个包含 for 循环的函数,该循环将遍历每一行,在 Durations 中为每个数字编制索引,根据 Num 列中的值平均每第 n 个数字,创建并附加一个新的dataframe 来存储这些平均值,然后将原始 dataframe 与新的作为一列附加。
我想它可能看起来像下面这样,但我是 Python 和编码的新手,所以我不确定:
Duration = np.empty(len(Onsets))
def averages(data):
for ionset,onset in enumerate(onsets):
Duration[ionset] = #what I described above
我怎样才能做到这一点?
这可以通过使用 pandas 中的应用功能来实现
参考:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
import pandas as pd
df = pd.DataFrame({
"FileName": ["FileName1"],
"Num": 3,
"Onsets": [[10, 11, 12, 34, 53, 22, 56, 24, 63, 24, 35, 1]],
"Offsets": [[13, 25, 2, 35, 63, 23, 63, 23, 765, 24, 6, 1]],
"Durations": [[1, 356, 6, 2, 6, 2, 2 , 2, 6, 65, 23, 2]]
})
def calculate_average(num, values):
n_values = len(values) / num
averages = list()
for i in range(0, num):
summation = 0
for j in range(0, int(n_values)):
summation += values[j*num]
averages.append(summation/n_values)
return averages
df["Onsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Onsets"]), axis=1)
df["Offsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Offsets"]), axis=1)
df["Durations_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Durations"]), axis=1)
这是为什么应避免将可迭代对象作为值保存在数据框中的原因之一。大多数 pandas 功能不同意这种结构。
也就是说,您可以通过 df.explode
和 df.transform
以及其他技巧来获得解决方案。
import pandas as pd
import numpy as np
# sample data
# please always provide a callable line of code with your data
# you can get it with df.head(10).to_dict('split')
# read more about this in
# and
np.random.seed(1)
df = pd.DataFrame({
'FileName': ['FileName1', 'FileName2', 'FileName3'],
'Num': [3, 5, 4],
'Onsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
'Offsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
'Durations': [np.random.randint(1, 1000, 12) for _ in range(3)]
})
print(df)
FileName Num Onsets \
0 FileName1 3 [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1 FileName2 5 [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2 FileName3 4 [908, 253, 491, 669, 926, 399, 563, 581, 216, ...
Offsets \
0 [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1 [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2 [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...
Durations
0 [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1 [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2 [934, 314, 596, 167, 649, 289, 419, 779, 280, ...
代码
# work with a temp dataframe
df2 = df[['FileName', 'Num', 'Durations']].explode('Durations')
df2.Durations = df2.Durations.astype(int) # needed only because of how the sample was created
# should not be necessary with your dataframe
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName').Durations.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag']).Durations.mean() # get average
df2.rename('Duration_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df = df.merge(df2.groupby('FileName').agg(list), on='FileName')
输出
FileName Num Onsets \
0 FileName1 3 [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1 FileName2 5 [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2 FileName3 4 [908, 253, 491, 669, 926, 399, 563, 581, 216, ...
Offsets \
0 [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1 [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2 [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...
Durations \
0 [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1 [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2 [934, 314, 596, 167, 649, 289, 419, 779, 280, ...
Durations_avgs
0 [267.0, 506.25, 349.5]
1 [679.6666666666666, 382.3333333333333, 396.5, ...
2 [621.0, 419.6666666666667, 589.0, 344.66666666...
更新
Kshitij 展示了为此定义一个函数的好主意(如果您想为多个列执行此操作)。但是,如果可以使用 pandas 本机函数来完成,则最好避免 apply
。
这是一个为任何列动态执行此操作的函数:
def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
'''
Add a column inplace, with the averages
of each `Num` cyclical item for each row
'''
# work with a new dataframe
df2 = (
df[['FileName', 'Num', column]]
.explode('Durations', ignore_index=True)
)
# needed only because of how the sample was created
# should not be necessary with your dataframe
df2.Durations = df2.Durations.astype(int)
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName')[column]
.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
df2.rename(f'{column}_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df2 = df2.groupby('FileName').agg(list)
df = df.merge(df2, on='FileName')
return df
df = get_averages(df, 'Durations')
我有一个 pandas 数据框,如下所示:
FileName Num Onsets Offsets Durations
FileName1 3 [19, ..., 1023] [188, ..., 1252] [169, ..., 229]
FileName2 5 [52, ..., 2104] [472, ..., 2457] [420, ..., 353]
FileName3 4 [18, ..., 1532] [356, ..., 2018] [338, ..., 486]
这显示时间序列中事件的开始和偏移时间、每个事件的持续时间以及每个事件之间的时间。每个时间序列实际上是一小组事件的重复,每组事件的数量在 Num 列中。例如,在第一行中,Onsets、Offsets 和 Durations 可能各有 12 个值,这意味着基础事件集重复了 4 次。换句话说,在每一列中,模式看起来像 [a,b,c,a,b,c,a,b,c,a,b,c]。
我需要为 Durations 列找到基础集合中每个头寸的平均值。这意味着每个 a 持续时间、每个 b 持续时间、每个 c 持续时间等的平均值.然后我需要将这些平均值附加到新列中的数据框。这意味着 Averages 列中数组的长度将等于 Num 列中的值。
我假设要做的事情是创建一个包含 for 循环的函数,该循环将遍历每一行,在 Durations 中为每个数字编制索引,根据 Num 列中的值平均每第 n 个数字,创建并附加一个新的dataframe 来存储这些平均值,然后将原始 dataframe 与新的作为一列附加。
我想它可能看起来像下面这样,但我是 Python 和编码的新手,所以我不确定:
Duration = np.empty(len(Onsets))
def averages(data):
for ionset,onset in enumerate(onsets):
Duration[ionset] = #what I described above
我怎样才能做到这一点?
这可以通过使用 pandas 中的应用功能来实现 参考:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
import pandas as pd
df = pd.DataFrame({
"FileName": ["FileName1"],
"Num": 3,
"Onsets": [[10, 11, 12, 34, 53, 22, 56, 24, 63, 24, 35, 1]],
"Offsets": [[13, 25, 2, 35, 63, 23, 63, 23, 765, 24, 6, 1]],
"Durations": [[1, 356, 6, 2, 6, 2, 2 , 2, 6, 65, 23, 2]]
})
def calculate_average(num, values):
n_values = len(values) / num
averages = list()
for i in range(0, num):
summation = 0
for j in range(0, int(n_values)):
summation += values[j*num]
averages.append(summation/n_values)
return averages
df["Onsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Onsets"]), axis=1)
df["Offsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Offsets"]), axis=1)
df["Durations_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Durations"]), axis=1)
这是为什么应避免将可迭代对象作为值保存在数据框中的原因之一。大多数 pandas 功能不同意这种结构。
也就是说,您可以通过 df.explode
和 df.transform
以及其他技巧来获得解决方案。
import pandas as pd
import numpy as np
# sample data
# please always provide a callable line of code with your data
# you can get it with df.head(10).to_dict('split')
# read more about this in
# and
np.random.seed(1)
df = pd.DataFrame({
'FileName': ['FileName1', 'FileName2', 'FileName3'],
'Num': [3, 5, 4],
'Onsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
'Offsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
'Durations': [np.random.randint(1, 1000, 12) for _ in range(3)]
})
print(df)
FileName Num Onsets \
0 FileName1 3 [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1 FileName2 5 [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2 FileName3 4 [908, 253, 491, 669, 926, 399, 563, 581, 216, ...
Offsets \
0 [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1 [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2 [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...
Durations
0 [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1 [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2 [934, 314, 596, 167, 649, 289, 419, 779, 280, ...
代码
# work with a temp dataframe
df2 = df[['FileName', 'Num', 'Durations']].explode('Durations')
df2.Durations = df2.Durations.astype(int) # needed only because of how the sample was created
# should not be necessary with your dataframe
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName').Durations.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag']).Durations.mean() # get average
df2.rename('Duration_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df = df.merge(df2.groupby('FileName').agg(list), on='FileName')
输出
FileName Num Onsets \
0 FileName1 3 [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1 FileName2 5 [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2 FileName3 4 [908, 253, 491, 669, 926, 399, 563, 581, 216, ...
Offsets \
0 [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1 [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2 [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...
Durations \
0 [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1 [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2 [934, 314, 596, 167, 649, 289, 419, 779, 280, ...
Durations_avgs
0 [267.0, 506.25, 349.5]
1 [679.6666666666666, 382.3333333333333, 396.5, ...
2 [621.0, 419.6666666666667, 589.0, 344.66666666...
更新
Kshitij 展示了为此定义一个函数的好主意(如果您想为多个列执行此操作)。但是,如果可以使用 pandas 本机函数来完成,则最好避免 apply
。
这是一个为任何列动态执行此操作的函数:
def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
'''
Add a column inplace, with the averages
of each `Num` cyclical item for each row
'''
# work with a new dataframe
df2 = (
df[['FileName', 'Num', column]]
.explode('Durations', ignore_index=True)
)
# needed only because of how the sample was created
# should not be necessary with your dataframe
df2.Durations = df2.Durations.astype(int)
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName')[column]
.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
df2.rename(f'{column}_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df2 = df2.groupby('FileName').agg(list)
df = df.merge(df2, on='FileName')
return df
df = get_averages(df, 'Durations')