如何根据 python 中最长重复的数字拆分系列？

Question

df = pd.DataFrame({
    'label':[f"subj_{i}" for i in range(28)],
    'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})

我有一个类似的数据集。看起来像：

我想在 0 的最长重复出现的地方剪切它，所以我想在索引 18 处剪切，但我想保留索引 14-16 不变。到目前为止，我已经尝试过类似的东西：

计数器

cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
    if row['cadence'] == 0:
        cad_recorder += 1
        new_index.append(i)

* But obviously that won't work since the indices will be rewritten at each occurrance of zero.

我也试过字典，但我不确定如何使用 iterrows 比较上一个值和下一个值。
我还一次取了 X 行的滚动平均值，如果它为零，那么我得到了一个索引。但后来我陷入了实际推断指数范围的困境。或者找到最长的零序列。

编辑：我的一个朋友提出了以下逻辑，它给出了与@shubham-sharma 相同的结果。发帖人的解决方案更加pythonic和优雅。

def find_longest_zeroes(df):
    '''
    Finds the index at which the longest reptitions of <1 values begin
    '''
    current_length = 0
    max_length = 0
    start_idx = 0
    max_idx = 0


    for i in range(len(df['data'])):
        if df.iloc[i,9] <= 1:
            if current_length == 0:
                start_idx = i
            current_length += 1

            if current_length > max_length:
                max_length = current_length
                max_idx = start_idx
        else:
            current_length = 0
    return max_idx

我按照@shubham-sharma 的解决方案使用的代码：

cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()

for lab in df['label'].unique():
    temp_df = df[df['label'] == lab].reset_index(drop=True)
    mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
    counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
    idx = counts.idxmax()
    # my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
    if (idx > 2000) & (counts.loc[idx] > 500):
        cut_us_sof[lab] = idx
        og_df_sof = og_df_sof.append(temp_df)
        cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])

Answer 1

我们可以使用布尔掩码和cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax来获取具有最大连续零的块的起始索引

m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()

print(idx)

18

如何根据 python 中最长重复的数字拆分系列？

How to split a series by the longest repetition of a number in python?

python

dataset

dataframe

pandas