如何根据 python 中最长重复的数字拆分系列?
How to split a series by the longest repetition of a number in python?
df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
我有一个类似的数据集。看起来像:
我想在 0 的最长重复出现的地方剪切它,所以我想在索引 18 处剪切,但我想保留索引 14-16 不变。到目前为止,我已经尝试过类似的东西:
- 计数器
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
- 我也试过字典,但我不确定如何使用 iterrows 比较上一个值和下一个值。
- 我还一次取了 X 行的滚动平均值,如果它为零,那么我得到了一个索引。但后来我陷入了实际推断指数范围的困境。或者找到最长的零序列。
编辑:我的一个朋友提出了以下逻辑,它给出了与@shubham-sharma 相同的结果。发帖人的解决方案更加pythonic和优雅。
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
我按照@shubham-sharma 的解决方案使用的代码:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
我有一个类似的数据集。看起来像:
我想在 0 的最长重复出现的地方剪切它,所以我想在索引 18 处剪切,但我想保留索引 14-16 不变。到目前为止,我已经尝试过类似的东西:
- 计数器
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
- 我也试过字典,但我不确定如何使用 iterrows 比较上一个值和下一个值。
- 我还一次取了 X 行的滚动平均值,如果它为零,那么我得到了一个索引。但后来我陷入了实际推断指数范围的困境。或者找到最长的零序列。
编辑:我的一个朋友提出了以下逻辑,它给出了与@shubham-sharma 相同的结果。发帖人的解决方案更加pythonic和优雅。
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
我按照@shubham-sharma 的解决方案使用的代码:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])