将一组标识符重置为 Pandas 数据帧列中的一系列连续序列号

Question

我已经从数据帧生成了三个输出，我正在尝试通过从每个输出的 1 开始重置我的句子的标识符 (Sentence_ID)。

输出示例：

Sentence_ID  Mention Tag
6388    Chailland   B-LOCATION
6388    ,   O
6388    Mayenne B-LOCATION

6389    poste   O
6389    de  O
6389    Goumois B-LOCATION
6389    (   I-LOCATION
6389    Doubs   I-LOCATION
6389    )   I-LOCATION
6389    .   O
        
6390    Pichet  B-PERSON
6390    (   O
6390    veuve   O
6390    )   O
6390    ,   O
6390    de  O
6390    Paris   B-LOCATION
6390    .   O
... continue

预期输出：

Sentence_ID  Mention Tag
1 Chailland B-LOCATION
1   ,   O
1   Mayenne B-LOCATION

2   poste   O
2   de  O
2   Goumois B-LOCATION
2   (   I-LOCATION
2   Doubs   I-LOCATION
2   )   I-LOCATION
2   .   O
        
3   Pichet  B-PERSON
3   (   O
3   veuve   O
3   )   O
3   ,   O
3   de  O
3   Paris   B-LOCATION
3   .   O
... continue

我一定是遗漏了什么，但不确定我是否应该在 Sentence_id 列（通过 group_by()）或 reset_index 上应用计数器来完成这个任务。

如果有人有线索，请提前致谢。

Answer 1

您可以创建一个字典，其键是旧 ID，值是新 ID，并用它来映射一个新的 Sentence_ID 列

mapping = dict(zip(df["Sentence_ID"].unique(), range(1, df["Sentence_ID"].nunique() +1)))
df["Sentence_ID"] = df["Sentence_ID"].map(mapping)

Answer 2

可以使用pd.factorize生成一组新的序号，如下：

df['Sentence_ID'] = pd.factorize(df['Sentence_ID'])[0] + 1

或使用Series.factorize

df['Sentence_ID'] = df['Sentence_ID'].factorize()[0] + 1

结果：

print(df)


    Sentence_ID    Mention         Tag
0             1  Chailland  B-LOCATION
1             1          ,           O
2             1    Mayenne  B-LOCATION
3             2      poste           O
4             2         de           O
5             2    Goumois  B-LOCATION
6             2          (  I-LOCATION
7             2      Doubs  I-LOCATION
8             2          )  I-LOCATION
9             2          .           O
10            3     Pichet    B-PERSON
11            3          (           O
12            3      veuve           O
13            3          )           O
14            3          ,           O
15            3         de           O
16            3      Paris  B-LOCATION
17            3          .           O

将一组标识符重置为 Pandas 数据帧列中的一系列连续序列号

Reset a group of identifiers to a sequence of consecutive serial numbers in a Pandas dataframe column

python

nlp

python-3.x

pandas