我在探索导出的 WhatsApp 聊天数据集时遇到了正则表达式障碍
I hit a regex roadblock while exploring a dataset of exported WhatsApp chats
我正在创建导出的 WhatsApp 聊天数据集。要操作数据,我需要将聊天记录的每一行拆分为 date
、time
、sender
和 message
(列)。
import pandas as pd
import re
column_names = ["date", "time", "sender", "message"]
data = pd.read_table("datasets/WhatsApp Chat with Makay.txt", sep="re.split(', |- |:', data2)", names = column_names)
data.head()
Output: The entire line/string goes into the date column, while time, sender
and message all return NaN values.
下面是字符串的示例:
string: '04/10/2020, 12:34 - Sender: Alright. This is the "message", with multiple de-limiters.'
Expected output: ['04/10/2020', '12:34', 'Sender', 'Alright. This is
the "message", with multiple de-limiters.']
我尝试了以下模式:
re.split(', |-|:', string)
re.split('\d\d[- /.]\d\d[- /.]\d\d\d\d[- /.] *?|, *?|- *?|: ', string)
re.split('\d\d[- ./]\d\d[- ./]\d\d\d\d[- ./]|, |- |: ', string)
但他们都失败了。其他类似标记的问题似乎没有回答这个特定的问题。我还访问了 regex101 网络应用程序,但找不到解决方案。有帮助吗?
您可以匹配 4 个捕获组中的所有不同部分,而不是使用拆分。
^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)
^
字符串开头
(\d{2}/\d{2}/\d{4})
捕获 组 1 匹配日期格式 (您可以更具体)
,
字面匹配
(\d{2}:\d{2})
捕获第2组匹配时间格式(你也可以更具体)
-
字面匹配
([^:]+)
捕获 组 3 匹配任何字符 1+ 次,除了 :
:\s*
匹配 :
和可选的空白字符
(.+)
捕获 组 4 匹配任意字符 1+ 次
import pandas as pd
import re
column_names = ["date", "time", "sender", "message"]
with open('datasets/WhatsApp Chat with Makay.txt', 'r', encoding="utf8") as file:
items = re.findall(
r"^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)",
file.read(),
re.MULTILINE
)
df = pd.DataFrame(items, columns=column_names)
pd.set_option('display.max_colwidth', None)
print(df)
输出
date time sender message
0 04/10/2020 12:34 Sender Alright. This is the "message", with multiple de-limiters.
我正在创建导出的 WhatsApp 聊天数据集。要操作数据,我需要将聊天记录的每一行拆分为 date
、time
、sender
和 message
(列)。
import pandas as pd
import re
column_names = ["date", "time", "sender", "message"]
data = pd.read_table("datasets/WhatsApp Chat with Makay.txt", sep="re.split(', |- |:', data2)", names = column_names)
data.head()
Output: The entire line/string goes into the date column, while time, sender and message all return NaN values.
下面是字符串的示例:
string: '04/10/2020, 12:34 - Sender: Alright. This is the "message", with multiple de-limiters.'
Expected output: ['04/10/2020', '12:34', 'Sender', 'Alright. This is the "message", with multiple de-limiters.']
我尝试了以下模式:
re.split(', |-|:', string)
re.split('\d\d[- /.]\d\d[- /.]\d\d\d\d[- /.] *?|, *?|- *?|: ', string)
re.split('\d\d[- ./]\d\d[- ./]\d\d\d\d[- ./]|, |- |: ', string)
但他们都失败了。其他类似标记的问题似乎没有回答这个特定的问题。我还访问了 regex101 网络应用程序,但找不到解决方案。有帮助吗?
您可以匹配 4 个捕获组中的所有不同部分,而不是使用拆分。
^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)
^
字符串开头(\d{2}/\d{2}/\d{4})
捕获 组 1 匹配日期格式 (您可以更具体),
字面匹配(\d{2}:\d{2})
捕获第2组匹配时间格式(你也可以更具体)-
字面匹配([^:]+)
捕获 组 3 匹配任何字符 1+ 次,除了:
:\s*
匹配:
和可选的空白字符(.+)
捕获 组 4 匹配任意字符 1+ 次
import pandas as pd
import re
column_names = ["date", "time", "sender", "message"]
with open('datasets/WhatsApp Chat with Makay.txt', 'r', encoding="utf8") as file:
items = re.findall(
r"^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)",
file.read(),
re.MULTILINE
)
df = pd.DataFrame(items, columns=column_names)
pd.set_option('display.max_colwidth', None)
print(df)
输出
date time sender message
0 04/10/2020 12:34 Sender Alright. This is the "message", with multiple de-limiters.