数据帧分割和丢弃

Data frame segmentation and dropping

我在 pandas 中有以下 DataFrame:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

我想创建一个新列,在该列中,我想根据 B 列的条件从 A 列获取值。条件是如果两个连续的 ''txt'' 之间没有'BW'',那么我将把它们放在 C 列上。但是如果在两个连续的 ''BW'' 之间有 ''txt'',我想删除所有这些值。所以预期的输出应该是这样的:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]

我不知道该怎么做。非常感谢任何帮助。

我不知道这是否是最有效的方法,但您可以通过以下方式映射列 B 中的值来创建一个名为 mask 的新列:'BW'True'txt'False 以及所有其他值到 np.nan.

然后如果你向前填充来自mask的NaN,向后填充来自mask的NaN并逻辑组合结果(只要向前或向后填充之一设置为True columns 为 False),您可以创建一个名为 final_mask 的列,其中包含 txt 的连续 BW 之间的所有值都用 True 填充。

只有当 final_mask 为假且 B 列不是 'BW'、[=37 时,您才可以使用 .apply 到 select 列 A 的值=] B 列,如果 final_mask 为 False,B 列为 'BW',否则为 np.nan

import numpy as np
import pandas as pd

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
df = pd.DataFrame({'A':A,'B':B})

df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
df["ffill"] = df["mask"].fillna(method="ffill")
df["bfill"] = df["mask"].fillna(method="bfill")
df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)

df["C"] = df.apply(lambda x: x['A'] if (
    (x['final_mask'] == False) & (x['B'] != 'BW')) 
    else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW')) 
    else np.nan, axis=1
)

>>> df
     A    B   mask  ffill  bfill  final_mask    C
0    1   24    NaN    NaN   True       False    1
1   10   23    NaN    NaN   True       False   10
2   23   29    NaN    NaN   True       False   23
3   45   BW   True   True   True       False   BW
4   24   49    NaN   True   True       False   24
5   24   59    NaN   True   True       False   24
6   55   72    NaN   True   True       False   55
7   67   BW   True   True   True       False   BW
8   73    9    NaN   True  False        True  NaN
9   26  183    NaN   True  False        True  NaN
10  13   17    NaN   True  False        True  NaN
11  96  txt  False  False  False        True  NaN
12  53    2    NaN  False   True        True  NaN
13  23   49    NaN  False   True        True  NaN
14  24   BW   True   True   True       False   BW
15  43  479    NaN   True   True       False   43
16  90   BW   True   True   True       False   BW

删除我们在此过程中创建的列:

df.drop(columns=['mask','ffill','bfill','final_mask'])

     A    B    C
0    1   24    1
1   10   23   10
2   23   29   23
3   45   BW   BW
4   24   49   24
5   24   59   24
6   55   72   55
7   67   BW   BW
8   73    9  NaN
9   26  183  NaN
10  13   17  NaN
11  96  txt  NaN
12  53    2  NaN
13  23   49  NaN
14  24   BW   BW
15  43  479   43
16  90   BW   BW

编辑:

更新的答案在最终 df 中缺少 BW 的值。

import pandas as pd
import numpy as np

BW = 999
txt = -999
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

df = pd.DataFrame({'A': A, 'B': B})
df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.A)
df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
df['C'] = df['C'].astype('Int64')
df = df.drop('group', axis=1)
In [435]: df
Out[435]: 
     A    B     C
0    1   24     1
1   10   23    10
2   23   29    23
3   45  999   999 <-- BW
4   24   49    24
5   24   59    24
6   55   72    55
7   67  999   999 <-- BW
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA> <-- txt is in the middle of BW
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999 <-- BW
15  43  479    43
16  90  999   999 <-- BW

你可以这样实现,假设BWtxt是特定值我只是用一些随机数填充它们来区分它们

In [277]: BW = 999

In [278]: txt = -999

In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
     ...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]

In [300]: df = pd.DataFrame({'A': A, 'B': B})

In [301]: df
Out[301]: 
     A    B
0    1   24
1   10   23
2   23   29
3   45  999
4   24   49
5   24   59
6   55   72
7   67  999
8   73    9
9   26  183
10  13   17
11  96 -999
12  53    2
13  23   49
14  24  999
15  43  479
16  90  999

首先让我们拆分不同的值组,这里我将它们拆分为唯一的组,其中每个组包含值 BW 和下一个 [=16] 之间的 B 值=].

In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())

In [322]: df
Out[322]: 
     A    B      group
0    1   24 0.00000000
1   10   23 0.00000000
2   23   29 0.00000000
3   45  999        NaN
4   24   49 1.00000000
5   24   59 1.00000000
6   55   72 1.00000000
7   67  999        NaN
8   73    9 2.00000000
9   26  183 2.00000000
10  13   17 2.00000000
11  96 -999 2.00000000
12  53    2 2.00000000
13  23   49 2.00000000
14  24  999        NaN
15  43  479 3.00000000
16  90  999        NaN

接下来使用 np.where() 我们可以根据您设置的条件替换值。

In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)

In [432]: df
Out[432]: 
     A    B      group            C
0    1   24 0.00000000  24.00000000
1   10   23 0.00000000  23.00000000
2   23   29 0.00000000  29.00000000
3   45  999        NaN 999.00000000
4   24   49 1.00000000  49.00000000
5   24   59 1.00000000  59.00000000
6   55   72 1.00000000  72.00000000
7   67  999        NaN 999.00000000
8   73    9 2.00000000          NaN
9   26  183 2.00000000          NaN
10  13   17 2.00000000          NaN
11  96 -999 2.00000000          NaN
12  53    2 2.00000000          NaN
13  23   49 2.00000000          NaN
14  24  999        NaN 999.00000000
15  43  479 3.00000000 479.00000000
16  90  999        NaN 999.00000000

这里我们需要将 B 等于 BW for C 设置回 B 的值。

In [488]: df['C'] = np.where(df['B'] == BW, df['B'], df['C'])

In [489]: df
Out[489]: 
     A    B      group            C
0    1   24 0.00000000  24.00000000
1   10   23 0.00000000  23.00000000
2   23   29 0.00000000  29.00000000
3   45  999        NaN 999.00000000
4   24   49 1.00000000  49.00000000
5   24   59 1.00000000  59.00000000
6   55   72 1.00000000  72.00000000
7   67  999        NaN 999.00000000
8   73    9 2.00000000          NaN
9   26  183 2.00000000          NaN
10  13   17 2.00000000          NaN
11  96 -999 2.00000000          NaN
12  53    2 2.00000000          NaN
13  23   49 2.00000000          NaN
14  24  999        NaN 999.00000000
15  43  479 3.00000000 479.00000000
16  90  999        NaN 999.00000000

最后只需将 float 列转换为 int 并删除我们不再需要的 group 列。如果您想保持 NaN 值为 np.nan,则忽略到 Int64.

的转换
In [396]: df.C = df.C.astype('Int64')

In [397]: df
Out[397]: 
     A    B      group     C
0    1   24 0.00000000    24
1   10   23 0.00000000    23
2   23   29 0.00000000    29
3   45  999        NaN   999
4   24   49 1.00000000    49
5   24   59 1.00000000    59
6   55   72 1.00000000    72
7   67  999        NaN   999
8   73    9 2.00000000  <NA>
9   26  183 2.00000000  <NA>
10  13   17 2.00000000  <NA>
11  96 -999 2.00000000  <NA>
12  53    2 2.00000000  <NA>
13  23   49 2.00000000  <NA>
14  24  999        NaN   999
15  43  479 3.00000000   479
16  90  999        NaN   999

In [398]: df = df.drop('group', axis=1)

In [435]: df
Out[435]: 
     A    B     C
0    1   24    24
1   10   23    23
2   23   29    29
3   45  999   999
4   24   49    49
5   24   59    59
6   55   72    72
7   67  999   999
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA>
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999
15  43  479   479
16  90  999   999