有没有办法只从直方图中去除第一个间隙并获取 Python 中的所有剩余值？

Question

我有一个包含以下字段的数据框：'unique years'、'counts'。我绘制了这个数据框，得到了以下直方图：histogram - example。我需要定义一个 start year 变量，但是如果我在直方图的起点有空白，我需要跳过它们并移动 start year .我想知道是否有 pythonic 方法可以做到这一点。在直方图 - 示例图中，我在起点处有一个非空箱，但随后我与空箱有很大差距。因此，我需要找到具有连续非空箱的点并将该点定义为起始年份（对于上述示例，我需要起始年份为 1935 年）。 n numpy.ndarray 为我提供了有关垃圾箱是否为空的信息，但我需要一种有效的方法来解决此问题。谢谢:)

我的数据框示例：

import pandas as pd

data = {'unique_years': [1907, 1935, 1938, 1939, 1940],
        'counts'      : [11, 14, 438, 85, 8]}

df = pd.DataFrame(data, columns = ['unique_years', 'counts'])

直方图代码

   (n, bins, patches) = plt.hist(df.unique_years, bins=25, label='hst')
   plt.show()

Answer 1

这应该可以消除所有不连续的 bin。我主要在 df 上工作。您可以使用它来绘制直方图

df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
yd = df.unique_years.diff().eq(1)
df[yd|yd.shift(-1)]

这是您将得到的结果：

Answer 2

你的问题的问题是 'continuous' 在这里没有很好地定义。你的意思是每个 year 应该有一个非空计数（这很容易做到，因为你可以在构建直方图之前过滤你的数据），或者每个连续的 bucket非空？如果是后者，这意味着您必须：

构建直方图
在生成的容器中过滤数据
要么使用过滤后的直方图，要么重新分箱剩余数据，分箱大小不能保证保持不变（因此您可能对新分箱有同样的问题！）

由于很难确切知道与您的具体情况相关的内容，我认为最好的答案是为您提供一套您可以根据自己的需要使用的工具，以解决您遇到的具体问题:

我想从某个日期开始过滤我的数据

filtered = df.unique_years[df.unique_years > 1930]

我要找第二个非空垃圾箱

(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]

从那里你可以：

重新绑定过滤后的数据：

(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Re-binning on the filtered data
plt.hist(df.unique_years[df.unique_years >= n[second_nonempty]], bins=25)

直接在过滤后的 bin 上绘制直方图：

(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Forcing the bins to take the provided values
plt.hist(df.unique_years, bins=x[second_nonempty:])

现在上面的 'second_nonempty' 当然可以替换为您想要开始的任何估算器，例如：

# Last empty bin + 1
all_bins_full_after = np.where(n == 0)[0][-1] + 1

或者其他什么真的

有没有办法只从直方图中去除第一个间隙并获取 Python 中的所有剩余值？

Is there a way to cut only the first gap from histogram and take all the remain values in Python?

python

histogram

dataframe

python-3.x

pandas