根据 pandas 中的先前值标记字符串

Question

我想标记位于 pandas 数据框中的句子。正如您在示例中看到的，一些句子被分成多行（这些是来自 srt 文件的字幕，我最终想将其翻译成另一种语言，但首先我需要将它们放在一个单元格中）。句子的结尾由句末的句号决定。我想创建一个像列句子一样的列，我在其中为每个句子编号（它不一定是字符串，也可以是数字）

values=[
        ['This is an example of subtitle.','sentence_1'],
        ['I want to group by sentences, which','sentence_2'],
        ['the end is determined by a period.','sentence_2'],
        ['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
        ['should have sentence_2.','sentence_2'],
        ['and this last row should have sentence_3.','sentence_3']
        ]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df

output:

    subtitle                                         sentence_number    presence_of_period
0   This is an example of subtitle.                  sentence_1         True
1   I want to group by sentences, which              sentence_2         False
2   the end is determined by a period.               sentence_2         True
3   row 0 should have sentece_1, rows 1 and 2        sentence_3         False
4   should have sentence_2. and this                 sentence_3         True
5   last row should have sentence_3.                 sentence_4         True

我如何创建 sentence_number 列，因为它必须读取字幕列上的先前单元格？我在考虑 window 函数或 shift() 但无法弄清楚如何使其工作。我添加了一列来显示单元格是否有句点，表示句子的结尾。另外，如果可能的话，我想将“and this”从第 4 行移到第 5 行的开头，因为这是一个新句子（不确定这个是否需要不同的问题）。

有什么想法吗？

Answer 1

要固定句号，这里有一个选项供您选择。

import pandas as pd
values=[
        ['This is an example of subtitle.','sentence_1'],
        ['I want to group by sentences, which','sentence_2'],
        ['the end is determined by a period.','sentence_2'],
        ['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
        ['should have sentence_2.','sentence_2'],
        ['and this last row should have sentence_3.','sentence_3']
        ]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])

输出结果如下：

                                     subtitle  sentence_#
0             This is an example of subtitle.  sentence_1
1         I want to group by sentences, which  sentence_2
2          the end is determined by a period.  sentence_2
3  row 0 should have sentece_1, rows 1 and 2   sentence_3
4                     should have sentence_2.  sentence_3
5   and this last row should have sentence_3.  sentence_4

如果您需要将部分句子移到下一行，我需要了解更多细节。

如果连续有两个以上的句子，你想做什么。例如，'This is first sentence. This second. This is'.

在这种情况下你想做什么。将第一个拆分为一行，第二个拆分为另一行，将第三个拆分为下一行数据？

一旦我明白了这一点，我们就可以使用df.explode()来解决它了。

根据 pandas 中的先前值标记字符串

flag strings based on previous values in pandas

python

window-functions

pandas