如何添加在不包括 NaT 的文本行组上递增的索引

Question

我有一个数据框，其中有一列代码具有连续的文本行，后跟连续的空值行 (NaN)。

  codes
  FKW
  FCJ
  XQ8
  1L9
  NaN
  NaN
  PNU
  LIT
  NaN
  422

一组字母数字代码和缺失值 (NaN) 形成一个 cycle.I 想要添加一列循环索引 (index)，该列在下一个循环开始时递增。当缺失值 (NaN) 后跟一个代码（字母数字值）时，下一个循环开始。

code    index
FKW     1
FCJ     1
XQ8     1
1L9     1
NaN     1
NaN     1
PNU     2   next group starts here
LIT     2
NaN     2
422     3   next group starts here

这是生成上述示例的代码

    def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
        return ''.join(random.choice(chars) for _ in range(size))    
    num_rows = 10
    data = np.array([id_generator() for i in range(num_rows)])
    df = pd.DataFrame(data, columns=['code'])
    df.code[4,5,8]=NaN
    print('what i have')
    print(df)
    print('what I want')
    df['index']=[1,1,1,1,1,1,2,2,2,3]
    print(df)

如何生成索引列？

Answer 1

我能想到的最简单的方法是迭代数据框的内容并跟踪最后一个值是否为 NaN。

index = []
index_counter = 1
last_was_NaN = False
for row in df.itertuples():
    if type(row[1]) is float and np.isnan(row[1]):  # check if second column (first after pandas indices) is NaN
        last_was_NaN = True
    elif last_was_NaN:  # if we have text now, we can store that and increase the counter
        last_was_NaN = False
        index_counter += 1
    index.append(index_counter)  # don't forget to add the calculated index
df['index'] = index

Answer 2

试试这个：

s = df.codes.notna()
df['index'] = (s & ~(s.shift(fill_value=False))).cumsum()

Out[718]:
  codes  index
0   FKW      1
1   FCJ      1
2   XQ8      1
3   1L9      1
4   NaN      1
5   NaN      1
6   PNU      2
7   LIT      2
8   NaN      2
9   422      3

如何添加在不包括 NaT 的文本行组上递增的索引

How to add an index that increments on groups of rows of text not including NaT

increment

pandas