根据子字符串拆分列表
Splitting a List Based on a Substring
我有以下列表:
['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
我想把这个列表拆分成多个列表,这样每个子列表都会有子字符串“(Reg)”出现一次:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
我试过用定界符加入列表并用 (Reg) 拆分它,但这没有用。如何将列表拆分为上面的嵌套列表?
我们可以为此使用 for
循环并使用两个列表:一个列表用于构建当前行,另一个列表存储我们当前拥有的所有行。喜欢:
rows = []
row = []
for word in data:
if '(Reg)' in word:
rows.append(row)
row = []
row.append(word)
rows.append(row)
与 data
初始字符串列表。
但是这有一个问题:它会首先添加一个空行(假定第一个元素中有 (Reg)
。我们可以通过仅添加非空的 row
s 来防止这种情况, 比如:
rows = []
row = []
for word in data:
if '(Reg)' in word:
if row:
rows.append(row)
row = []
row.append(word)
if row:
rows.append(row)
我们可以将以上概括为一个专用函数:
split_at(data, predicate, with_empty=False):
rows = []
row = []
for word in data:
if predicate(word):
if with_empty or row:
rows.append(row)
row = []
row.append(word)
if with_empty or row:
rows.append(row)
return rows
然后我们可以这样称呼它:
split_at(our_list, lambda x: '(Reg)' in x)
WVO 答案的略有不同(优化)版本:
splitted = []
for item in l:
if '(Reg)' in item:
splitted.append([])
splitted[-1].append(item)
#[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'],
# ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'],
# ['5(Reg)', '98', '99', '99', '100'],
# ['6(Reg)', '99.47', '99.86', '99.67', '100']]
您可以将 itertools.groupby
与正则表达式一起使用:
import itertools
import re
s = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
new_data = [list(b) for _, b in itertools.groupby(s, key=lambda x:bool(re.findall('\d+\(', x)))]
final_data = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
输出:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
这是一种方式,但不一定是最佳方式:
from itertools import zip_longest
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
'3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
'5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
indices = [i for i, j in enumerate(lst) if '(Reg)' in j]
lst_new = [lst[i:j] for i, j in zip_longest(indices, indices[1:])]
# [['1(Reg)', '100', '103', '102', '100'],
# ['2(Reg)', '98', '101', '100'],
# ['3(Reg)', '96', '99', '98'],
# ['4(Reg)', '100', '100', '100', '100'],
# ['5(Reg)', '98', '99', '99', '100'],
# ['6(Reg)', '99.47', '99.86', '99.67', '100']]
好的,这是我对超级简单的标准列表推导的看法(与@jp_data_analysis 的回答非常相似):
>>> from pprint import pprint
>>> d = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
>>> idx = filter(lambda i: d[i].endswith("(Reg)"), range(len(d))) + [len(d)]
>>> idx
[0, 5, 9, 13, 18, 23, 28]
>>> res = [d[idx[i-1]:idx[i]] for i in range(1,len(idx))]
>>> pprint(res)
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
说明:idx
保存每个以 (Reg)
结尾的元素的索引(包括作为最后一个元素的列表长度)。然后通过这些元素之间的间隔定义列表 res
。
哲学笔记:每次遇到这样的问题时,问问自己:我是怎么到这里来的?为什么我需要处理一些超级脆弱的隐式字符串格式规则而不是真正的数据结构?一个考虑间隔和数据层次结构的?一种通过设计强制限制并允许简单查询的方法? 在 Twitter 上找到可以指责的人并大声疾呼 :)
使用itertools.groupby
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
from itertools import groupby
[a+b for a,b in zip(*([iter(list(g) for k, g in groupby(lst, lambda x:'Reg' in x))]*2))]
输出:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
你也可以试试这个:
from itertools import groupby
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
'3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
'5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
grouped = [list(g) for k, g in groupby(lst, key = lambda x: x.endswith('(Reg)'))]
result = [x + y for x, y in zip(grouped[0::2], grouped[1::2])]
print(result)
哪些输出:
[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'], ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'], ['5(Reg)', '98', '99', '99', '100'], ['6(Reg)', '99.47', '99.86', '99.67', '100']]
这是另一种没有库的方法。这是一个基于 DYZ 答案的列表理解:
w = []
[w.append([e]) if '(Reg)' in e else w[-1].append(e) for e in data]
我有以下列表:
['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
我想把这个列表拆分成多个列表,这样每个子列表都会有子字符串“(Reg)”出现一次:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
我试过用定界符加入列表并用 (Reg) 拆分它,但这没有用。如何将列表拆分为上面的嵌套列表?
我们可以为此使用 for
循环并使用两个列表:一个列表用于构建当前行,另一个列表存储我们当前拥有的所有行。喜欢:
rows = []
row = []
for word in data:
if '(Reg)' in word:
rows.append(row)
row = []
row.append(word)
rows.append(row)
与 data
初始字符串列表。
但是这有一个问题:它会首先添加一个空行(假定第一个元素中有 (Reg)
。我们可以通过仅添加非空的 row
s 来防止这种情况, 比如:
rows = []
row = []
for word in data:
if '(Reg)' in word:
if row:
rows.append(row)
row = []
row.append(word)
if row:
rows.append(row)
我们可以将以上概括为一个专用函数:
split_at(data, predicate, with_empty=False):
rows = []
row = []
for word in data:
if predicate(word):
if with_empty or row:
rows.append(row)
row = []
row.append(word)
if with_empty or row:
rows.append(row)
return rows
然后我们可以这样称呼它:
split_at(our_list, lambda x: '(Reg)' in x)
WVO 答案的略有不同(优化)版本:
splitted = []
for item in l:
if '(Reg)' in item:
splitted.append([])
splitted[-1].append(item)
#[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'],
# ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'],
# ['5(Reg)', '98', '99', '99', '100'],
# ['6(Reg)', '99.47', '99.86', '99.67', '100']]
您可以将 itertools.groupby
与正则表达式一起使用:
import itertools
import re
s = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
new_data = [list(b) for _, b in itertools.groupby(s, key=lambda x:bool(re.findall('\d+\(', x)))]
final_data = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
输出:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
这是一种方式,但不一定是最佳方式:
from itertools import zip_longest
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
'3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
'5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
indices = [i for i, j in enumerate(lst) if '(Reg)' in j]
lst_new = [lst[i:j] for i, j in zip_longest(indices, indices[1:])]
# [['1(Reg)', '100', '103', '102', '100'],
# ['2(Reg)', '98', '101', '100'],
# ['3(Reg)', '96', '99', '98'],
# ['4(Reg)', '100', '100', '100', '100'],
# ['5(Reg)', '98', '99', '99', '100'],
# ['6(Reg)', '99.47', '99.86', '99.67', '100']]
好的,这是我对超级简单的标准列表推导的看法(与@jp_data_analysis 的回答非常相似):
>>> from pprint import pprint
>>> d = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
>>> idx = filter(lambda i: d[i].endswith("(Reg)"), range(len(d))) + [len(d)]
>>> idx
[0, 5, 9, 13, 18, 23, 28]
>>> res = [d[idx[i-1]:idx[i]] for i in range(1,len(idx))]
>>> pprint(res)
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
说明:idx
保存每个以 (Reg)
结尾的元素的索引(包括作为最后一个元素的列表长度)。然后通过这些元素之间的间隔定义列表 res
。
哲学笔记:每次遇到这样的问题时,问问自己:我是怎么到这里来的?为什么我需要处理一些超级脆弱的隐式字符串格式规则而不是真正的数据结构?一个考虑间隔和数据层次结构的?一种通过设计强制限制并允许简单查询的方法? 在 Twitter 上找到可以指责的人并大声疾呼 :)
使用itertools.groupby
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
from itertools import groupby
[a+b for a,b in zip(*([iter(list(g) for k, g in groupby(lst, lambda x:'Reg' in x))]*2))]
输出:
[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]
你也可以试试这个:
from itertools import groupby
lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
'3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
'5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
grouped = [list(g) for k, g in groupby(lst, key = lambda x: x.endswith('(Reg)'))]
result = [x + y for x, y in zip(grouped[0::2], grouped[1::2])]
print(result)
哪些输出:
[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'], ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'], ['5(Reg)', '98', '99', '99', '100'], ['6(Reg)', '99.47', '99.86', '99.67', '100']]
这是另一种没有库的方法。这是一个基于 DYZ 答案的列表理解:
w = []
[w.append([e]) if '(Reg)' in e else w[-1].append(e) for e in data]