将数据集分成块并自动计算这些块的均值
Partitioning dataset in chunks and compute the means of these chunks automatically
我需要从数据集中生成小块。然后计算每个块的平均值。最后,创建一个列表或数组来存储所有这些手段。我的目标是使流程自动化。
例如:
我的数据是 [2,5,1,5,3,8,4,2,33,65,34,11,42]。如果块大小是 3,那么我希望有:
part0 = mydata[0 : 3] => 2, 5, 1 => mean0 = 2.66
part1 = mydata[3 : 6] => 5, 3, 8 => mean1 = 5.33
part2 = mydata[6 : 9] => 4, 2, 33 => mean2 = 13.0
part3 = mydata[9 : 12] => 65, 34, 11 => mean3 = 36.66
part4 = mydata[12 : ] => 42 => mean4 = 42.0
list_of_means = {mean0, mean1, mean2, mean3, mean4}
我只是不知道如何创建方法列表。
下面是我尝试使用 For 循环和 eval() 的代码。
import numpy as np
mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*1000))]
for i in range(0, len(mydata)):
globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']' #This works
globals()[f"mean{i}"] = eval(np.mean(eval('part'+str(i)))) #This brings an error
试试这个...然后告诉我它是否符合您的需要
# external paramters
data = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3
# program - chunker
remainder = len(data) % chunk_size
chunk_amounts = len(data) // chunk_size
if remainder != 0:
chunk_amounts = len(data)//chunk_size + 1
chunks = []
for i in range(chunk_amounts):
chunk = data[chunk_size * i: chunk_size * (i + 1)]
mean = sum(chunk)//len(data) # as float, use // for an integer value
chunks += [mean]
print(i, chunk, mean)
编辑 与 exec
for i in range(0, len(mydata)):
data_slice = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
index_ref[(i * chunk_size) + chunk_size]) + ']'
exec(f"part{i} = {data_slice}")
exec(f'mean{i} = sum(part{i})//3')
print(part2)
print(mean2)
输出
[4, 2, 33]
13
编辑 eval
mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*5))]
for i in range(0, len(mydata)):
data_slice_pattern = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
index_ref[(i * chunk_size) + chunk_size]) + ']'
globals()[f"part{i}"] = data_slice_pattern
data_slice = eval(data_slice_pattern)
globals()[f'mean{i}'] = sum(data_slice)//3
means = [m for m in globals() if 'mean' in m]
print('part2' in globals())
print(eval('part2'))
print('mean2' in globals())
print(eval('mean3'))
print(means)
输出
True
mydata[6 : 9]
True
36
['mean0', 'mean1', 'mean2', 'mean3', 'mean4', 'mean5', 'mean6', 'mean7', 'mean8', 'mean9', 'mean10', 'mean11', 'mean12']
range()
可以采用 start
、end
和 step
参数。所以获取切片的规范方法之一是做这样的事情:
# verbose
chunks = []
for i in range(0, len(mydata), chunk_size):
chunks.append(mydata[i:i+chunk_size])
# or as comprehension
chunks = [mydata[i:i+chunk_size] for i in range(0, len(mydata), chunk_size)]
然后计算每个元素的平均值,例如,
from statistics import mean
list_of_means = [mean(c) for c in chunks]
一般说明: 混用 globals
和 eval
来实例化变量不仅非常冗长且对于较大的数据集不可行,而且也很 hacky ,危险(因为 eval
可以执行任意的、可能有害的代码),而且不是处理事情的首选方式。如果您有一堆值,请使用适当的数据结构,如列表、字典或任何适合这种情况的数据结构,并使用 for
循环处理这些项目。如果遇到索引错误,请尝试了解导致错误的原因,而不是使用无法解决问题原因的临时补丁。这只是未来要记住的事情——我们都在这里学习,坚持下去!
一个相当长的工作解决方案:
mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42,76,12,76,31]
# program - chunker
chunk_size = 3
remainder = len(mydata) % chunk_size
chunk_amounts = len(mydata) // chunk_size
if remainder != 0:
chunk_amounts = len(mydata)//chunk_size + 1
# prepare indexing
index_ref = [n for n in range(len(mydata*chunk_amounts))] #to avoid index out of range
# partitioning and mean calculation
for i in range(0, len(mydata)):
globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']'
globals()[f"mean{i}"] = np.mean(eval(eval('part'+str(i))))
# build mean list
mean_list = []
for i in range(0, chunk_amounts):
mean_list.append('mean'+str(i))
# remove quotes in list elements for easier export to numpy array
mean_list = ", ".join(mean_list)
print(mean_list)
输出:
'mean0, mean1, mean2, mean3, mean4, mean5'
Post-处理中:
mean_array = np.array([mean0, mean1, mean2, mean3, mean4, mean5])
mean_array = mean_array.round(2)
mean_array
输出:
[ 2.67, 5.33, 13. , 36.67, 43.33, 53.5 ]
Numpy 可以提供帮助。
如果您知道块的数量。您可以使用 Numpy 创建一个包含 n 个块的数组。 Numpy 自动评估块大小。
import numpy as np
mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
mydata = np.array(mydata) #convert list to Numpy array
number_chuncks = 6 #wanted number of chunks
chunks = np.array_split(mydata,number_chuncks)
print('chunks: ', chunks)
结果:
chunks: [array([2, 5, 1]), array([5, 3]), array([8, 4]), array([ 2, 33]), array([65, 34]), array([11, 42])]
Chunks的预期数量可以这样计算:
data_length = mydata.size
spacing = np.linspace(0, data_length, number_chuncks)
chunk_size = spacing[1]
print('Expected Chunk size: ',chunk_size)
结果:
Expected Chunk size: 2.6
这就是为什么在块数组中一些块大小为 3 而其他块大小为 2 的原因。
最后,均值可以计算为:
list_of_means = [np.mean(c) for c in chunks]
print('Means: ',list_of_means)
结果:
Means: [ 2.67 4. 6. 17.5 49.5 26.5 ]
我需要从数据集中生成小块。然后计算每个块的平均值。最后,创建一个列表或数组来存储所有这些手段。我的目标是使流程自动化。 例如: 我的数据是 [2,5,1,5,3,8,4,2,33,65,34,11,42]。如果块大小是 3,那么我希望有:
part0 = mydata[0 : 3] => 2, 5, 1 => mean0 = 2.66
part1 = mydata[3 : 6] => 5, 3, 8 => mean1 = 5.33
part2 = mydata[6 : 9] => 4, 2, 33 => mean2 = 13.0
part3 = mydata[9 : 12] => 65, 34, 11 => mean3 = 36.66
part4 = mydata[12 : ] => 42 => mean4 = 42.0
list_of_means = {mean0, mean1, mean2, mean3, mean4}
我只是不知道如何创建方法列表。
下面是我尝试使用 For 循环和 eval() 的代码。
import numpy as np
mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*1000))]
for i in range(0, len(mydata)):
globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']' #This works
globals()[f"mean{i}"] = eval(np.mean(eval('part'+str(i)))) #This brings an error
试试这个...然后告诉我它是否符合您的需要
# external paramters
data = [2,5,1,5,3,8,4,2,33,65,34,11,42]
chunk_size = 3
# program - chunker
remainder = len(data) % chunk_size
chunk_amounts = len(data) // chunk_size
if remainder != 0:
chunk_amounts = len(data)//chunk_size + 1
chunks = []
for i in range(chunk_amounts):
chunk = data[chunk_size * i: chunk_size * (i + 1)]
mean = sum(chunk)//len(data) # as float, use // for an integer value
chunks += [mean]
print(i, chunk, mean)
编辑 与 exec
for i in range(0, len(mydata)):
data_slice = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
index_ref[(i * chunk_size) + chunk_size]) + ']'
exec(f"part{i} = {data_slice}")
exec(f'mean{i} = sum(part{i})//3')
print(part2)
print(mean2)
输出
[4, 2, 33]
13
编辑 eval
mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
chunk_size = 3
index_ref = [n for n in range(len(mydata*5))]
for i in range(0, len(mydata)):
data_slice_pattern = 'mydata[' + str(index_ref[i * chunk_size]) + ' : ' + str(
index_ref[(i * chunk_size) + chunk_size]) + ']'
globals()[f"part{i}"] = data_slice_pattern
data_slice = eval(data_slice_pattern)
globals()[f'mean{i}'] = sum(data_slice)//3
means = [m for m in globals() if 'mean' in m]
print('part2' in globals())
print(eval('part2'))
print('mean2' in globals())
print(eval('mean3'))
print(means)
输出
True
mydata[6 : 9]
True
36
['mean0', 'mean1', 'mean2', 'mean3', 'mean4', 'mean5', 'mean6', 'mean7', 'mean8', 'mean9', 'mean10', 'mean11', 'mean12']
range()
可以采用 start
、end
和 step
参数。所以获取切片的规范方法之一是做这样的事情:
# verbose
chunks = []
for i in range(0, len(mydata), chunk_size):
chunks.append(mydata[i:i+chunk_size])
# or as comprehension
chunks = [mydata[i:i+chunk_size] for i in range(0, len(mydata), chunk_size)]
然后计算每个元素的平均值,例如,
from statistics import mean
list_of_means = [mean(c) for c in chunks]
一般说明: 混用 globals
和 eval
来实例化变量不仅非常冗长且对于较大的数据集不可行,而且也很 hacky ,危险(因为 eval
可以执行任意的、可能有害的代码),而且不是处理事情的首选方式。如果您有一堆值,请使用适当的数据结构,如列表、字典或任何适合这种情况的数据结构,并使用 for
循环处理这些项目。如果遇到索引错误,请尝试了解导致错误的原因,而不是使用无法解决问题原因的临时补丁。这只是未来要记住的事情——我们都在这里学习,坚持下去!
一个相当长的工作解决方案:
mydata = [2,5,1,5,3,8,4,2,33,65,34,11,42,76,12,76,31]
# program - chunker
chunk_size = 3
remainder = len(mydata) % chunk_size
chunk_amounts = len(mydata) // chunk_size
if remainder != 0:
chunk_amounts = len(mydata)//chunk_size + 1
# prepare indexing
index_ref = [n for n in range(len(mydata*chunk_amounts))] #to avoid index out of range
# partitioning and mean calculation
for i in range(0, len(mydata)):
globals()[f"part{i}"] ='mydata['+str(index_ref[i*chunk_size])+' : '+str(index_ref[(i*chunk_size)+chunk_size])+']'
globals()[f"mean{i}"] = np.mean(eval(eval('part'+str(i))))
# build mean list
mean_list = []
for i in range(0, chunk_amounts):
mean_list.append('mean'+str(i))
# remove quotes in list elements for easier export to numpy array
mean_list = ", ".join(mean_list)
print(mean_list)
输出:
'mean0, mean1, mean2, mean3, mean4, mean5'
Post-处理中:
mean_array = np.array([mean0, mean1, mean2, mean3, mean4, mean5])
mean_array = mean_array.round(2)
mean_array
输出:
[ 2.67, 5.33, 13. , 36.67, 43.33, 53.5 ]
Numpy 可以提供帮助。 如果您知道块的数量。您可以使用 Numpy 创建一个包含 n 个块的数组。 Numpy 自动评估块大小。
import numpy as np
mydata = [2, 5, 1, 5, 3, 8, 4, 2, 33, 65, 34, 11, 42]
mydata = np.array(mydata) #convert list to Numpy array
number_chuncks = 6 #wanted number of chunks
chunks = np.array_split(mydata,number_chuncks)
print('chunks: ', chunks)
结果:
chunks: [array([2, 5, 1]), array([5, 3]), array([8, 4]), array([ 2, 33]), array([65, 34]), array([11, 42])]
Chunks的预期数量可以这样计算:
data_length = mydata.size
spacing = np.linspace(0, data_length, number_chuncks)
chunk_size = spacing[1]
print('Expected Chunk size: ',chunk_size)
结果:
Expected Chunk size: 2.6
这就是为什么在块数组中一些块大小为 3 而其他块大小为 2 的原因。
最后,均值可以计算为:
list_of_means = [np.mean(c) for c in chunks]
print('Means: ',list_of_means)
结果:
Means: [ 2.67 4. 6. 17.5 49.5 26.5 ]