根据 pandas 中的先前值标记字符串
flag strings based on previous values in pandas
我想标记位于 pandas 数据框中的句子。正如您在示例中看到的,一些句子被分成多行(这些是来自 srt 文件的字幕,我最终想将其翻译成另一种语言,但首先我需要将它们放在一个单元格中)。句子的结尾由句末的句号决定。我想创建一个像列句子一样的列,我在其中为每个句子编号(它不一定是字符串,也可以是数字)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
我如何创建 sentence_number 列,因为它必须读取字幕列上的先前单元格?我在考虑 window 函数或 shift() 但无法弄清楚如何使其工作。我添加了一列来显示单元格是否有句点,表示句子的结尾。另外,如果可能的话,我想将“and this”从第 4 行移到第 5 行的开头,因为这是一个新句子(不确定这个是否需要不同的问题)。
有什么想法吗?
要固定句号,这里有一个选项供您选择。
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
输出结果如下:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
如果您需要将部分句子移到下一行,我需要了解更多细节。
如果连续有两个以上的句子,你想做什么。例如,'This is first sentence. This second. This is'
.
在这种情况下你想做什么。将第一个拆分为一行,第二个拆分为另一行,将第三个拆分为下一行数据?
一旦我明白了这一点,我们就可以使用df.explode()
来解决它了。
我想标记位于 pandas 数据框中的句子。正如您在示例中看到的,一些句子被分成多行(这些是来自 srt 文件的字幕,我最终想将其翻译成另一种语言,但首先我需要将它们放在一个单元格中)。句子的结尾由句末的句号决定。我想创建一个像列句子一样的列,我在其中为每个句子编号(它不一定是字符串,也可以是数字)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
我如何创建 sentence_number 列,因为它必须读取字幕列上的先前单元格?我在考虑 window 函数或 shift() 但无法弄清楚如何使其工作。我添加了一列来显示单元格是否有句点,表示句子的结尾。另外,如果可能的话,我想将“and this”从第 4 行移到第 5 行的开头,因为这是一个新句子(不确定这个是否需要不同的问题)。
有什么想法吗?
要固定句号,这里有一个选项供您选择。
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
输出结果如下:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
如果您需要将部分句子移到下一行,我需要了解更多细节。
如果连续有两个以上的句子,你想做什么。例如,'This is first sentence. This second. This is'
.
在这种情况下你想做什么。将第一个拆分为一行,第二个拆分为另一行,将第三个拆分为下一行数据?
一旦我明白了这一点,我们就可以使用df.explode()
来解决它了。