如何在 histogram/dataset 中丢弃超过特定频率的数据？

Question

为了让事情更清楚，我不想从直方图中删除整个 bin，我只想删除一些数据，使其低于所需的频率。图中的线显示了我想要的最大频率

对于上下文，我有一个包含多个角度的数据集。就使用的数据而言，我的问题与此处提出的问题非常相似，但与 link 中的问题不同，我不想删除数据，只是减少数据。

我可以直接从直方图中执行此操作，还是只需要删除数据集中的一些数据？

编辑（抱歉，我是这里编码和格式的新手）：这是我尝试过的解决方案

bns = 30
hist, bins  = np.histogram(dataset['Steering'], bins= bns)
removeddata = []

spb = 700
for j in range(bns):
    rdata = []
    for i in range(len(dataset['Steering'])):
        if dataset['Steering'][i] >= bins[j] and dataset['Steering'][i] <= 
        bins[j+1]:
            rdata.append(i)
    rdata = shuffle(rdata)
    rdata = rdata[spb:]
    removeddata.extend(rdata)


print('removed:', len(removeddata))
dataset.drop(dataset.index[removeddata], inplace = True)
print ('remaining:', len(dataset))



center = (bins[:-1] + bins[1:])*0.5
plt.bar(center,hist,width=0.05)
plt.show()

这是别人的解决方案，但似乎对他们有用。即使直接复制，它仍然会抛出错误。我得到的错误是“ValueError：系列的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()”，我试图将 'and' 更改为 & 并收到错误“TypeError：无法使用 dtyped [float64] 数组和类型 [bool] 的标量执行 'rand_'”。不确定这到底指的是什么，但指向带有 if 语句的行。检查了所有的数据类型，它们都是 float64 类型，所以不确定我的下一步

Answer 1

这个答案不会重新分类或重新居中数据，但我相信它通常可以实现您的要求。从您链接的 post 的所选答案中的示例开始，我编辑 hist 数组，以便原始输入数据不会像您指出的那样更改是您的首选解决方案：

import numpy as np
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1,2)
ax1.set_title("Some data")
ax2.set_title("Gated data < threshold")

np.random.seed(10)
data = np.random.randn(1000)

num_bins = 23
avg_samples_per_bin = 200

hist, bins = np.histogram(data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)

threshold = 80

gated = np.empty([len(hist)], dtype=np.int64)
for i in range(len(hist)):
    if hist[i] > threshold:
        gated[i] = threshold
    else:
        gated[i] = hist[i]

ax2.bar(center, gated, align="center", width=width)

plt.show()

这给出了

Answer 2

此解决方案考虑了明确的要求，即丢弃超过频率阈值的原始输入数据。我留下了我的另一个答案，因为它更简单和不同，可能对其他用户有用。

为了澄清，这个答案生成了一个新的一维数据数组，其中包含更少的元素，然后根据该新数据绘制直方图。在删除元素之前对数据进行混洗（如果输入数据已预先排序），以防止从每个 bin 的低侧或高侧丢弃数据时出现偏差。

import numpy as np
import matplotlib.pyplot as plt
from random import shuffle


def remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst):
    if to_gate_lst[idx] == 0:
        return(data_lst)
    else:
        bin_min, bin_max = bins_lst[idx], bins_lst[idx + 1]
        for i in range(len(data_lst)):
            if bin_min <= data_lst[i] < bin_max:
                del data_lst[i]
                to_gate_lst[idx] -= 1
                break
        return remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst)

    
threshold = 80

fig, ax1 = plt.subplots()
ax1.set_title("Some data")

np.random.seed(30)
data = np.random.randn(1000)

num_bins = 23

raw_hist, raw_bins = np.histogram(data, num_bins)

to_gate = []
for i in range(len(raw_hist)):
    if raw_hist[i] > threshold:
        to_gate.append(raw_hist[i] - threshold)
    else:
        to_gate.append(0)

data_lst = list(data)
shuffle(data_lst)

for idx in range(len(raw_hist)):
    remove_gated_val_recursive(idx, to_gate, raw_bins, data_lst)
    
new_data = np.array(data_lst)
hist, bins = np.histogram(new_data, num_bins)

width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)

plt.show()

给出以下直方图，根据 new_data 数组绘制。

如何在 histogram/dataset 中丢弃超过特定频率的数据？

How to drop data above a certain frequency in a histogram/dataset?

python

histogram