多个分隔符的嵌套列表中的文本到 CSV(跳过第一次出现)
Text to CSV in nested lists for multiple delimiters (skipping first occurrence)
我的数据是这样的
[['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
我试图将这个嵌套列表分成日期、时间、用户名、消息的每一列。
现在我的分隔符是
,
分隔日期,
-
分开时间,
:
分隔用户名和消息
但问题是如果我使用 :
,它也会拆分时间,因为它的格式是 XX:XX
。
到目前为止,我的第一步是正确拆分,然后我可以继续转换为 csv。
尝试 1 -
我试图在阅读时直接拆分数据,但没有任何改变。
delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
data = []
for line in open('/content/drive/My Drive/sample.txt'):
items = line.rstrip('\r\n').split(regexPattern) # strip new-line characters and split on column delimiter
items = [item.strip() for item in items] # strip extra whitespace off data items
data.append(items)
尝试 2 -
我在写入 csv
时尝试拆分
delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
re.split(regexPattern,data)
writer.writerows(data)
这会出错,因为 split 需要一个字符串,而我有一个列表。不确定如何实现我的主要目标。
感谢任何帮助。
这是使用正则表达式组的完美案例。
s = '15/09/16, 12:21 pm - User1: Hey'
ms = re.match(r'(\d+/\d+/\d+).+?(\d+:\d+).+-\s(.*):\s(.*)', s)
print(ms.groups()) # ('15/09/16', '12:21', 'User1', 'Hey')
您可以将它们重新加入 csv 行。
使用模式re.compile(r",|\-|\:\s+")
例如:
import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
regexPattern = re.compile(r",|\-|\:\s+")
for i in data:
for j in i:
print(regexPattern.split(j))
输出:
['15/09/16', ' 12:21 pm ', ' User1', 'Hey']
['15/09/16', ' 12:22 pm ', ' User2', '<Media omitted>']
['15/09/16', ' 12:22 pm ', ' User2', "It's yesterday's work"]
['15/09/16', ' 12:22 pm ', ' User1', 'Gotta work on it.']
使用正则表达式分组。
演示:
import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it, what,hello.']]
regexPattern = re.compile(r"(?P<date>\d{2,}\/\d{2,}\/\d{2,}),\s*(?P<time>\d{2,}:\d{2,}\s*[a-z]{2,})\s*\-\s*(?P<user>\w+)\:\s*(?P<msg>.*)$")
for i in data:
for j in i:
print(regexPattern.match(j).groups())
输出:
('15/09/16', '12:21 pm', 'User1', 'Hey')
('15/09/16', '12:22 pm', 'User2', '<Media omitted>')
('15/09/16', '12:22 pm', 'User2', "It's yesterday's work")
('15/09/16', '12:22 pm', 'User1', 'Gotta work on it, what,hello.')
没有正则表达式
def parse(item):
date_time, user_message = item.split(' - ', 1)
return [*date_time.split(', '), *user_message.split(': ', 1)]
eggs = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
spam = [parse(egg[0]) for egg in eggs]
print(spam)
输出
[['15/09/16', '12:21 pm', 'User1', 'Hey'],
['15/09/16', '12:22 pm', 'User2', '<Media omitted>'],
['15/09/16', '12:22 pm', 'User2', "It's yesterday's work"],
['15/09/16', '12:22 pm', 'User1', 'Gotta work on it.']]
- 为清楚起见,输出的格式由我提供
- 您需要明确指定 maxsplit 为 1
我的数据是这样的
[['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
我试图将这个嵌套列表分成日期、时间、用户名、消息的每一列。
现在我的分隔符是
,
分隔日期,
-
分开时间,
:
分隔用户名和消息
但问题是如果我使用 :
,它也会拆分时间,因为它的格式是 XX:XX
。
到目前为止,我的第一步是正确拆分,然后我可以继续转换为 csv。
尝试 1 - 我试图在阅读时直接拆分数据,但没有任何改变。
delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
data = []
for line in open('/content/drive/My Drive/sample.txt'):
items = line.rstrip('\r\n').split(regexPattern) # strip new-line characters and split on column delimiter
items = [item.strip() for item in items] # strip extra whitespace off data items
data.append(items)
尝试 2 - 我在写入 csv
时尝试拆分delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
re.split(regexPattern,data)
writer.writerows(data)
这会出错,因为 split 需要一个字符串,而我有一个列表。不确定如何实现我的主要目标。
感谢任何帮助。
这是使用正则表达式组的完美案例。
s = '15/09/16, 12:21 pm - User1: Hey'
ms = re.match(r'(\d+/\d+/\d+).+?(\d+:\d+).+-\s(.*):\s(.*)', s)
print(ms.groups()) # ('15/09/16', '12:21', 'User1', 'Hey')
您可以将它们重新加入 csv 行。
使用模式re.compile(r",|\-|\:\s+")
例如:
import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
regexPattern = re.compile(r",|\-|\:\s+")
for i in data:
for j in i:
print(regexPattern.split(j))
输出:
['15/09/16', ' 12:21 pm ', ' User1', 'Hey']
['15/09/16', ' 12:22 pm ', ' User2', '<Media omitted>']
['15/09/16', ' 12:22 pm ', ' User2', "It's yesterday's work"]
['15/09/16', ' 12:22 pm ', ' User1', 'Gotta work on it.']
使用正则表达式分组。
演示:
import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it, what,hello.']]
regexPattern = re.compile(r"(?P<date>\d{2,}\/\d{2,}\/\d{2,}),\s*(?P<time>\d{2,}:\d{2,}\s*[a-z]{2,})\s*\-\s*(?P<user>\w+)\:\s*(?P<msg>.*)$")
for i in data:
for j in i:
print(regexPattern.match(j).groups())
输出:
('15/09/16', '12:21 pm', 'User1', 'Hey')
('15/09/16', '12:22 pm', 'User2', '<Media omitted>')
('15/09/16', '12:22 pm', 'User2', "It's yesterday's work")
('15/09/16', '12:22 pm', 'User1', 'Gotta work on it, what,hello.')
没有正则表达式
def parse(item):
date_time, user_message = item.split(' - ', 1)
return [*date_time.split(', '), *user_message.split(': ', 1)]
eggs = [['15/09/16, 12:21 pm - User1: Hey'],
['15/09/16, 12:22 pm - User2: <Media omitted>'],
["15/09/16, 12:22 pm - User2: It's yesterday's work"],
['15/09/16, 12:22 pm - User1: Gotta work on it.']]
spam = [parse(egg[0]) for egg in eggs]
print(spam)
输出
[['15/09/16', '12:21 pm', 'User1', 'Hey'],
['15/09/16', '12:22 pm', 'User2', '<Media omitted>'],
['15/09/16', '12:22 pm', 'User2', "It's yesterday's work"],
['15/09/16', '12:22 pm', 'User1', 'Gotta work on it.']]
- 为清楚起见,输出的格式由我提供
- 您需要明确指定 maxsplit 为 1