在正则表达式不匹配的地方添加 NaN

Question

import pandas as pd
df= pd.DataFrame({'Date':['nothing ',
                              'This 1A1619 A124 person BL171111 the A-1-24 and ',
                              'dont Z112 but NOT 12-24-1981',
                               'nada here either',
                              'mix: 1A25629Q88 or A13B ok A1 the A16'],
                      'IDs': ['A11','B22','C33', 'D44', 'E55'],
                      })

这是 pulling mixed letters and numbers 的后续和变体。使用此代码

pat = r'((?<!\S)(?:[a-zA-Z]+\d|\d+[a-zA-Z])[a-zA-Z0-9]*(?!\S))'
df['Date'].str.extractall(pat)

给我

        0
   match    
1   0   1A1619
    1   A124
    2   BL171111
2   0   Z112
4   0   1A25629Q88
    1   A13B
    2   A1
    3   A16

我希望在 regex 不匹配的地方添加 NaN。所以我想要这样的东西

        0
   match    
0   NaN
1   0   1A1619
1   A124
2   BL171111
2   0   Z112
3   NaN
4   0   1A25629Q88
    1   A13B
    2   A1
    3   A16

我将如何修改我的代码来做到这一点？

Answer 1

鉴于 s 是 df['Date'].str.extractall(pat) 的 return，我们可以：

i = df.index.difference(s.index.get_level_values(0))
o = pd.DataFrame({0: np.nan}, index=[i, [0]*len(i)])
adjust = lambda s,o: pd.concat([s, o]).sort_index()

然后

>>> adjust(s,o)

                  0
  match            
0 0             NaN
1 0          1A1619
  1            A124
  2        BL171111
2 0            Z112
3 0             NaN
4 0      1A25629Q88
  1            A13B
  2              A1
  3             A16

在正则表达式不匹配的地方添加 NaN

adding NaN where regex doesn't match

regex

string

nan

python-3.x

pandas