我在探索导出的 WhatsApp 聊天数据集时遇到了正则表达式障碍

Question

我正在创建导出的 WhatsApp 聊天数据集。要操作数据，我需要将聊天记录的每一行拆分为 date、time、sender 和 message（列）。

import pandas as pd
import re

column_names = ["date", "time", "sender", "message"]
data = pd.read_table("datasets/WhatsApp Chat with Makay.txt", sep="re.split(', |- |:', data2)", names = column_names)
data.head()

Output: The entire line/string goes into the date column, while time, sender and message all return NaN values.

下面是字符串的示例：

string: '04/10/2020, 12:34 - Sender: Alright. This is the "message", with multiple de-limiters.'

Expected output: ['04/10/2020', '12:34', 'Sender', 'Alright. This is the "message", with multiple de-limiters.']

我尝试了以下模式：

re.split(', |-|:', string)

re.split('\d\d[- /.]\d\d[- /.]\d\d\d\d[- /.] *?|, *?|- *?|: ', string)

re.split('\d\d[- ./]\d\d[- ./]\d\d\d\d[- ./]|, |- |: ', string)

但他们都失败了。其他类似标记的问题似乎没有回答这个特定的问题。我还访问了 regex101 网络应用程序，但找不到解决方案。有帮助吗？

Answer 1

您可以匹配 4 个捕获组中的所有不同部分，而不是使用拆分。

^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)

^ 字符串开头
(\d{2}/\d{2}/\d{4}) 捕获 组 1 匹配日期格式 （您可以更具体）
, 字面匹配
(\d{2}:\d{2})捕获第2组匹配时间格式（你也可以更具体）
- 字面匹配
([^:]+) 捕获 组 3 匹配任何字符 1+ 次，除了 :
:\s* 匹配 : 和可选的空白字符
(.+) 捕获 组 4 匹配任意字符 1+ 次

Regex demo

import pandas as pd
import re

column_names = ["date", "time", "sender", "message"]
with open('datasets/WhatsApp Chat with Makay.txt', 'r', encoding="utf8") as file:
    items = re.findall(
        r"^(\d{2}/\d{2}/\d{4}), (\d{2}:\d{2}) - ([^:]+):\s*(.+)",
        file.read(),
        re.MULTILINE
    )

    df = pd.DataFrame(items, columns=column_names)
    pd.set_option('display.max_colwidth', None)

    print(df)

输出

         date   time  sender                                                     message
0  04/10/2020  12:34  Sender  Alright. This is the "message", with multiple de-limiters.

我在探索导出的 WhatsApp 聊天数据集时遇到了正则表达式障碍

I hit a regex roadblock while exploring a dataset of exported WhatsApp chats

python

regex

dataset

pandas

data-science