如何在 python 中使用正则表达式合并来自同一用户的 strings\sentences
How do I merge strings\sentences from same user using regex in python
我正在尝试使用正则表达式模式合并 python 中的句子,以便以有组织的格式获取我的数据,但我发现这很困难。
我尝试合并的原始数据格式如下
chat_segment =
"""
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""
我正在尝试将其转换为以下格式,将来自相同用户的句子合并在一起。
['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month','Sam: Hi Which class would you like the ticket in?','Visitor: Business','Sam: Give me a minute to check availability','Visitor: ok','Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']
这是我试过的代码,它根据时间戳拆分句子,但我想不出一种方法来合并同一用户的句子。请注意 "visitor" 名称相同,但名称 "sam" 发生变化。
chat_tokenised = re.split(r"\d+:\d+:\d+:\s+",chat_segment)
print(chat_tokenised)
您可以使用 itertools.groupby
(doc) 将项目组合在一起:
chat_segment = """
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""
import re
from itertools import groupby
out = []
for v, g in groupby(chat_segment.splitlines(), lambda k: re.findall(r'^\w+:\s', k)):
if not v:
continue
out.append(v[0] + ' '.join(re.findall(r'^\w+:\s[\d:]+:\s*(.*)', val)[0] for val in g))
from pprint import pprint
pprint(out, width=120)
打印:
['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month',
'Sam: Hi Which class would you like the ticket in?',
'Visitor: Business',
'Sam: Give me a minute to check availability',
'Visitor: ok',
'Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']
我正在尝试使用正则表达式模式合并 python 中的句子,以便以有组织的格式获取我的数据,但我发现这很困难。
我尝试合并的原始数据格式如下
chat_segment =
"""
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""
我正在尝试将其转换为以下格式,将来自相同用户的句子合并在一起。
['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month','Sam: Hi Which class would you like the ticket in?','Visitor: Business','Sam: Give me a minute to check availability','Visitor: ok','Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']
这是我试过的代码,它根据时间戳拆分句子,但我想不出一种方法来合并同一用户的句子。请注意 "visitor" 名称相同,但名称 "sam" 发生变化。
chat_tokenised = re.split(r"\d+:\d+:\d+:\s+",chat_segment)
print(chat_tokenised)
您可以使用 itertools.groupby
(doc) 将项目组合在一起:
chat_segment = """
13:54:09: Hello, thank you for visiting. Can I help you in any way?
Visitor: 13:47:16: I want to book a ticket from Thailand to Hawai
Visitor: 13:47:49: On 18th this month
Sam: 13:48:03: Hi
Sam: 13:48:18: Which class would you like the ticket in?
Visitor: 13:48:40: Business
Sam: 13:48:43: Give me a minute to check availability
Visitor: 13:48:55: ok
Sam: 13:49:41: Only one ticket available on 18th in business class.
Sam: 13:50:02: The ticket costs 500$.
Sam: 13:50:31: And the flight departs at 8 am"""
import re
from itertools import groupby
out = []
for v, g in groupby(chat_segment.splitlines(), lambda k: re.findall(r'^\w+:\s', k)):
if not v:
continue
out.append(v[0] + ' '.join(re.findall(r'^\w+:\s[\d:]+:\s*(.*)', val)[0] for val in g))
from pprint import pprint
pprint(out, width=120)
打印:
['Visitor: I want to book a ticket from Thailand to Hawai On 18th this month',
'Sam: Hi Which class would you like the ticket in?',
'Visitor: Business',
'Sam: Give me a minute to check availability',
'Visitor: ok',
'Sam: Only one ticket available on 18th in business class. The ticket costs 500$. And the flight departs at 8 am']