如何在 python 中使用正则表达式合并来自同一用户的 strings\sentences

How do I merge strings\sentences from same user using regex in python

我正在尝试使用正则表达式模式合并 python 中的句子,以便以有组织的格式获取我的数据,但我发现这很困难。

我尝试合并的原始数据格式如下

chat_segment = 
"""
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""

我正在尝试将其转换为以下格式,将来自相同用户的句子合并在一起。

['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month','Sam: Hi Which class would you like the ticket in?','Visitor: Business','Sam: Give me a minute to check availability','Visitor: ok','Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']

这是我试过的代码,它根据时间戳拆分句子,但我想不出一种方法来合并同一用户的句子。请注意 "visitor" 名称相同,但名称 "sam" 发生变化。

chat_tokenised = re.split(r"\d+:\d+:\d+:\s+",chat_segment)
print(chat_tokenised)

您可以使用 itertools.groupby (doc) 将项目组合在一起:

chat_segment = """
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""

import re
from itertools import groupby

out = []
for v, g in groupby(chat_segment.splitlines(), lambda k: re.findall(r'^\w+:\s', k)):
    if not v:
        continue
    out.append(v[0] + ' '.join(re.findall(r'^\w+:\s[\d:]+:\s*(.*)', val)[0] for val in g))

from pprint import pprint
pprint(out, width=120)

打印:

['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month',
 'Sam: Hi Which class would you like the ticket in?',
 'Visitor: Business',
 'Sam: Give me a minute to check availability',
 'Visitor: ok',
 'Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']