将数据集分成块并自动计算这些块的均值

Partitioning dataset in chunks and compute the means of these chunks automatically

我需要从数据集中生成小块。然后计算每个块的平均值。最后,创建一个列表或数组来存储所有这些手段。我的目标是使流程自动化。 例如: 我的数据是 [2,5,1,5,3,8,4,2,33,65,34,11,42]。如果块大小是 3,那么我希望有:

part0 = mydata[0 : 3]  => 2, 5, 1      => mean0 = 2.66
part1 = mydata[3 : 6]  => 5, 3, 8      => mean1 = 5.33
part2 = mydata[6 : 9]  => 4, 2, 33     => mean2 = 13.0
part3 = mydata[9 : 12] => 65, 34, 11   => mean3 = 36.66
part4 = mydata[12 : ]  => 42           => mean4 = 42.0

list_of_means = {mean0, mean1, mean2, mean3, mean4}

我只是不知道如何创建方法列表。

下面是我尝试使用 For 循环和 eval() 的代码。

import numpy as np
mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*1000))]

for i in range(0, len(mydata)):

    globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']' #This works
    
    globals()[f"mean{i}"] = eval(np.mean(eval('part'+str(i)))) #This brings an error

试试这个...然后告诉我它是否符合您的需要

# external paramters
data = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3

# program - chunker
remainder = len(data) % chunk_size
chunk_amounts = len(data) // chunk_size

if remainder != 0:
    chunk_amounts = len(data)//chunk_size + 1

chunks = []
for i in range(chunk_amounts):
    chunk = data[chunk_size * i: chunk_size * (i + 1)]
    mean = sum(chunk)//len(data) # as float, use // for an integer value
    chunks += [mean]

    print(i, chunk, mean)

编辑exec

for i in range(0, len(mydata)):

    data_slice = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
        index_ref[(i * chunk_size) + chunk_size]) + ']'

    exec(f"part{i} = {data_slice}")
    exec(f'mean{i} = sum(part{i})//3')

print(part2)
print(mean2)

输出

[4, 2, 33]
13

编辑 eval

mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*5))]

for i in range(0, len(mydata)):

    data_slice_pattern = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
        index_ref[(i * chunk_size) + chunk_size]) + ']'

    globals()[f"part{i}"] = data_slice_pattern
    data_slice = eval(data_slice_pattern)
    globals()[f'mean{i}'] = sum(data_slice)//3

means = [m for m in globals() if 'mean' in m]

print('part2' in globals())
print(eval('part2'))
print('mean2' in globals())
print(eval('mean3'))
print(means)

输出

True
mydata[6 : 9]
True
36
['mean0', 'mean1', 'mean2', 'mean3', 'mean4', 'mean5', 'mean6', 'mean7', 'mean8', 'mean9', 'mean10', 'mean11', 'mean12']

range() 可以采用 startendstep 参数。所以获取切片的规范方法之一是做这样的事情:

# verbose
chunks = []

for i in range(0, len(mydata), chunk_size):
    chunks.append(mydata[i:i+chunk_size])

# or as comprehension
chunks = [mydata[i:i+chunk_size] for i in range(0, len(mydata), chunk_size)]

然后计算每个元素的平均值,例如,

from statistics import mean

list_of_means = [mean(c) for c in chunks]

一般说明: 混用 globalseval 来实例化变量不仅非常冗长且对于较大的数据集不可行,而且也很 hacky ,危险(因为 eval 可以执行任意的、可能有害的代码),而且不是处理事情的首选方式。如果您有一堆值,请使用适当的数据结构,如列表、字典或任何适合这种情况的数据结构,并使用 for 循环处理这些项目。如果遇到索引错误,请尝试了解导致错误的原因,而不是使用无法解决问题原因的临时补丁。这只是未来要记住的事情——我们都在这里学习,坚持下去!

一个相当长的工作解决方案:

mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42,76,12,76,31]

# program - chunker
chunk_size = 3 
remainder = len(mydata) % chunk_size 
chunk_amounts = len(mydata) // chunk_size
if remainder != 0:
    chunk_amounts = len(mydata)//chunk_size + 1
    
# prepare indexing
index_ref = [n for n in range(len(mydata*chunk_amounts))] #to avoid index out of range

# partitioning and mean calculation
for i in range(0, len(mydata)):

    globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']'
      
    globals()[f"mean{i}"] = np.mean(eval(eval('part'+str(i))))

# build mean list
mean_list = []
for i in range(0, chunk_amounts):
    mean_list.append('mean'+str(i))

# remove quotes in list elements for easier export to numpy array
mean_list = ", ".join(mean_list)
print(mean_list)

输出:

'mean0, mean1, mean2, mean3, mean4, mean5'

Post-处理中:

mean_array = np.array([mean0, mean1, mean2, mean3, mean4, mean5])
mean_array = mean_array.round(2)
mean_array

输出:

[ 2.67,  5.33, 13.  , 36.67, 43.33, 53.5 ]

Numpy 可以提供帮助。 如果您知道块的数量。您可以使用 Numpy 创建一个包含 n 个块的数组。 Numpy 自动评估块大小。

import numpy as np
mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
mydata = np.array(mydata) #convert list to Numpy array
number_chuncks = 6 #wanted number of chunks
chunks = np.array_split(mydata,number_chuncks)
print('chunks: ', chunks)

结果:

chunks:  [array([2, 5, 1]), array([5, 3]), array([8, 4]), array([ 2, 33]), array([65, 34]), array([11, 42])]

Chunks的预期数量可以这样计算:

data_length = mydata.size
spacing = np.linspace(0, data_length, number_chuncks)
chunk_size = spacing[1]
print('Expected Chunk size: ',chunk_size)

结果:

Expected Chunk size:  2.6

这就是为什么在块数组中一些块大小为 3 而其他块大小为 2 的原因。

最后,均值可以计算为:

list_of_means = [np.mean(c) for c in chunks]
print('Means: ',list_of_means)

结果:

Means:  [ 2.67  4.    6.   17.5  49.5  26.5 ]