如何针对以下场景优化嵌套 for 循环或替代方法
How to optimize nested for loop or alternative approach for the below scenario
我有长度 [(4, 7, 10), (12, 40, 47), (2, 12, 23), (13, 16, 17), (37, 38, 47).... (41, 47, 48)
超过 2000 且长度为 3 的元组列表。我的 objective 是找到所有 3 个元组的集合
遵循这些规则:
- 第一个元组的第一个元素应小于第二个元组的第二个元素,第三个元组的第三个元素应小于第二个元组的第二个元素。
t[0][0] < t1[1][1] < t3[2][2]
示例:元组 1,3 和 4.(**4**, 7, 10), (2, **12**, 23), (13, 16, **17**)
**4 < 12 < 17**
集合中不应有任何重复值。例如:列表中的 1,2 & 4th 元组可以被认为是 (4, 7, 10) (12, 40, 47) & (13, 16, 17)
。我们没有考虑第 3 个,因为它有一个元素 12,它已经存在于第二个元组 (12, 40, 47)
.
中
如果我们从索引 1 中选择第一个元组,我们应该从大于 1 的索引中检查第二个元组,并且第三个元组的索引应该大于第二个元组的索引。
下面是我的程序,问题是如果范围是 500+ 非常高超过 30 分钟到未知,则找到所有集合所花费的时间。请帮我优化一下。
ts = datetime.now()
outcomes = loadSample("sample_data.txt")
ne = outcomes[0:300]
combinationsList = []
for i in range(0,len(ne)):
for j in range(i+1,len(ne)):
tt1 = []
tt1.append(ne[i])
tt1.append(ne[j])
if not (ne[i][0] < ne[j][1]) and (len(set((itertools.chain(*tt1)))) == 6): continue
for k in range(j+1,len(ne)):
t1 = []
t = (i, j, k)
t1.append(ne[i])
t1.append(ne[j])
t1.append(ne[k])
if (ne[j][0] < ne[k][1]) and (len(set((itertools.chain(*t1)))) == 9):
A=np.array([ne[i],ne[j],ne[k]])
b = np.diag(A)
is_sorted = lambda a: np.all(a[:-1] <= a[1:])
if is_sorted(b):
combinationsList.append(t)
print("Time to create : ",datetime.now() - ts)
你可以在这里找到我用来测试的元组文件。
非常感谢。
选项 1 - 基于字典
我认为这是一个优化的解决方案。总体思路是建立两个潜在匹配的字典:
- 从第二个元组到第一个元组(match_21)。
- 从第二个元组到第三个元组(match_23)。
此外,我们正在构建 tuple-1 和 tuple-3 (match_13) 之间的一组潜在匹配项。
然后我们遍历所有潜在的第二个元组,并通过检查它们是否存在于 match_13.
中来查找潜在 tuple1 列表和潜在 tuple3 列表中的项目之间的匹配项
import itertools
import datetime
ts = datetime.datetime.now()
print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
for line in f:
line = line.strip()
line = line.replace("(", "").replace(")", "")
t = tuple(int(x) for x in line.split(","))
tuples.append(t)
print(f"loaded {len(tuples)} tuples")
print("Building match 1-2")
match_21 = {}
for t2 in range(1, len(tuples)-1):
matches_for_t2 = []
for t1 in range(t2):
if tuples[t1][0] < tuples[t2][1] and set(tuples[t1]).isdisjoint(tuples[t2]):
matches_for_t2.append(t1)
match_21[t2] = matches_for_t2
print(len(match_21))
print("Building match 2-3")
match_23 = {}
for i in range(1, len(tuples)-1):
matches_for_i = []
for j in range(i+1, len(tuples)):
if tuples[i][1] < tuples[j][2] and set(tuples[i]).isdisjoint(tuples[j]):
matches_for_i.append(j)
match_23[i] = matches_for_i
print(len(match_23))
print("Building match 1-3")
match_13 = set()
for t1 in range(0, len(tuples)-2):
for t3 in range(t1 + 1, len(tuples)):
if tuples[t1][0] < tuples[t3][2] and set(tuples[t1]).isdisjoint(tuples[t3]):
match_13.add((t1, t3))
print(len(match_13))
print("Looking for matches")
counter = 0
max_t2 = len(tuples)-1
# max_t2 = 20 # for debugging
matches = []
for t2 in range(1, max_t2):
potential_t1 = match_21[t2]
potential_t3 = match_23[t2]
for t1, t3 in itertools.product(potential_t1, potential_t3):
if (t1, t3) in match_13:
match = (t1, t2, t3)
matches.append(match)
print(".", end="")
print(len(matches))
print(matches[:10])
print("Time to create : ", datetime.datetime.now() - ts)
这在我的笔记本电脑上大约需要 2:15 分钟。
选项 2 - 使用 pandas
另一种选择是利用 pandas 的强大功能来合并多个数据集。它更快,但并没有快得多。
print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
for line in f:
line = line.strip()
line = line.replace("(", "").replace(")", "")
t = tuple(int(x) for x in line.split(","))
tuples.append(t)
tuple_df = pd.DataFrame.from_records(tuples)
tuple_df.reset_index(inplace = True)
t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1),
on = "dummy", suffixes=["_t1", "_t2"])
# create 1-2 match
m_12 = t[(t.index_t1 < t.index_t2) & (t["0_t1"] < t["1_t2"])]
cond = (m_12["0_t1"] != m_12["0_t2"]) & (m_12["1_t1"] != m_12["0_t2"]) & (m_12["2_t1"] != m_12["0_t2"]) & \
(m_12["0_t1"] != m_12["1_t2"]) & (m_12["1_t1"] != m_12["1_t2"]) & (m_12["2_t1"] != m_12["1_t2"]) & \
(m_12["0_t1"] != m_12["2_t2"]) & (m_12["1_t1"] != m_12["2_t2"]) & (m_12["2_t1"] != m_12["2_t2"])
m_12 = m_12[cond]
# create 2-3 match
t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1),
on = "dummy", suffixes=["_t2", "_t3"])
m_23 = t[(t.index_t2 < t.index_t3) & (t["1_t2"] < t["2_t3"])]
cond = (m_23["0_t2"] != m_23["0_t3"]) & (m_23["1_t2"] != m_23["0_t3"]) & (m_23["2_t2"] != m_23["0_t3"]) & \
(m_23["0_t2"] != m_23["1_t3"]) & (m_23["1_t2"] != m_23["1_t3"]) & (m_23["2_t2"] != m_23["1_t3"]) & \
(m_23["0_t2"] != m_23["2_t3"]) & (m_23["1_t2"] != m_23["2_t3"]) & (m_23["2_t2"] != m_23["2_t3"])
m_23 = m_23[cond]
# create 1-3 match
on = "dummy", suffixes=["_t1", "_t3"])
m_13 = t[(t.index_t1 < t.index_t3) & (t["0_t1"] < t["2_t3"])]
cond = (m_13["0_t1"] != m_13["0_t3"]) & (m_13["1_t1"] != m_13["0_t3"]) & (m_13["2_t1"] != m_13["0_t3"]) & \
(m_13["0_t1"] != m_13["1_t3"]) & (m_13["1_t1"] != m_13["1_t3"]) & (m_13["2_t1"] != m_13["1_t3"]) & \
(m_13["0_t1"] != m_13["2_t3"]) & (m_13["1_t1"] != m_13["2_t3"]) & (m_13["2_t1"] != m_13["2_t3"])
m_13 = m_13[cond]
# Merge 1-2 and 2-3
m_123 = pd.merge(m_12[["index_t1", "index_t2"]], m_23[["index_t2", "index_t3"]], on="index_t2")
# filter by valid 1-3 matches
res = pd.merge(m_123, m_13[["index_t1", "index_t3"]], on = ["index_t1", "index_t3"], how = "inner")
总而言之,这在我的笔记本电脑上大约需要 1:15 分钟(但需要大量内存)。
我有长度 [(4, 7, 10), (12, 40, 47), (2, 12, 23), (13, 16, 17), (37, 38, 47).... (41, 47, 48)
超过 2000 且长度为 3 的元组列表。我的 objective 是找到所有 3 个元组的集合
遵循这些规则:
- 第一个元组的第一个元素应小于第二个元组的第二个元素,第三个元组的第三个元素应小于第二个元组的第二个元素。
t[0][0] < t1[1][1] < t3[2][2]
示例:元组 1,3 和 4.(**4**, 7, 10), (2, **12**, 23), (13, 16, **17**)
**4 < 12 < 17**
集合中不应有任何重复值。例如:列表中的 1,2 & 4th 元组可以被认为是
中(4, 7, 10) (12, 40, 47) & (13, 16, 17)
。我们没有考虑第 3 个,因为它有一个元素 12,它已经存在于第二个元组(12, 40, 47)
.如果我们从索引 1 中选择第一个元组,我们应该从大于 1 的索引中检查第二个元组,并且第三个元组的索引应该大于第二个元组的索引。
下面是我的程序,问题是如果范围是 500+ 非常高超过 30 分钟到未知,则找到所有集合所花费的时间。请帮我优化一下。
ts = datetime.now()
outcomes = loadSample("sample_data.txt")
ne = outcomes[0:300]
combinationsList = []
for i in range(0,len(ne)):
for j in range(i+1,len(ne)):
tt1 = []
tt1.append(ne[i])
tt1.append(ne[j])
if not (ne[i][0] < ne[j][1]) and (len(set((itertools.chain(*tt1)))) == 6): continue
for k in range(j+1,len(ne)):
t1 = []
t = (i, j, k)
t1.append(ne[i])
t1.append(ne[j])
t1.append(ne[k])
if (ne[j][0] < ne[k][1]) and (len(set((itertools.chain(*t1)))) == 9):
A=np.array([ne[i],ne[j],ne[k]])
b = np.diag(A)
is_sorted = lambda a: np.all(a[:-1] <= a[1:])
if is_sorted(b):
combinationsList.append(t)
print("Time to create : ",datetime.now() - ts)
你可以在这里找到我用来测试的元组文件。
非常感谢。
选项 1 - 基于字典
我认为这是一个优化的解决方案。总体思路是建立两个潜在匹配的字典:
- 从第二个元组到第一个元组(match_21)。
- 从第二个元组到第三个元组(match_23)。
此外,我们正在构建 tuple-1 和 tuple-3 (match_13) 之间的一组潜在匹配项。
然后我们遍历所有潜在的第二个元组,并通过检查它们是否存在于 match_13.
中来查找潜在 tuple1 列表和潜在 tuple3 列表中的项目之间的匹配项
import itertools
import datetime
ts = datetime.datetime.now()
print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
for line in f:
line = line.strip()
line = line.replace("(", "").replace(")", "")
t = tuple(int(x) for x in line.split(","))
tuples.append(t)
print(f"loaded {len(tuples)} tuples")
print("Building match 1-2")
match_21 = {}
for t2 in range(1, len(tuples)-1):
matches_for_t2 = []
for t1 in range(t2):
if tuples[t1][0] < tuples[t2][1] and set(tuples[t1]).isdisjoint(tuples[t2]):
matches_for_t2.append(t1)
match_21[t2] = matches_for_t2
print(len(match_21))
print("Building match 2-3")
match_23 = {}
for i in range(1, len(tuples)-1):
matches_for_i = []
for j in range(i+1, len(tuples)):
if tuples[i][1] < tuples[j][2] and set(tuples[i]).isdisjoint(tuples[j]):
matches_for_i.append(j)
match_23[i] = matches_for_i
print(len(match_23))
print("Building match 1-3")
match_13 = set()
for t1 in range(0, len(tuples)-2):
for t3 in range(t1 + 1, len(tuples)):
if tuples[t1][0] < tuples[t3][2] and set(tuples[t1]).isdisjoint(tuples[t3]):
match_13.add((t1, t3))
print(len(match_13))
print("Looking for matches")
counter = 0
max_t2 = len(tuples)-1
# max_t2 = 20 # for debugging
matches = []
for t2 in range(1, max_t2):
potential_t1 = match_21[t2]
potential_t3 = match_23[t2]
for t1, t3 in itertools.product(potential_t1, potential_t3):
if (t1, t3) in match_13:
match = (t1, t2, t3)
matches.append(match)
print(".", end="")
print(len(matches))
print(matches[:10])
print("Time to create : ", datetime.datetime.now() - ts)
这在我的笔记本电脑上大约需要 2:15 分钟。
选项 2 - 使用 pandas
另一种选择是利用 pandas 的强大功能来合并多个数据集。它更快,但并没有快得多。
print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
for line in f:
line = line.strip()
line = line.replace("(", "").replace(")", "")
t = tuple(int(x) for x in line.split(","))
tuples.append(t)
tuple_df = pd.DataFrame.from_records(tuples)
tuple_df.reset_index(inplace = True)
t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1),
on = "dummy", suffixes=["_t1", "_t2"])
# create 1-2 match
m_12 = t[(t.index_t1 < t.index_t2) & (t["0_t1"] < t["1_t2"])]
cond = (m_12["0_t1"] != m_12["0_t2"]) & (m_12["1_t1"] != m_12["0_t2"]) & (m_12["2_t1"] != m_12["0_t2"]) & \
(m_12["0_t1"] != m_12["1_t2"]) & (m_12["1_t1"] != m_12["1_t2"]) & (m_12["2_t1"] != m_12["1_t2"]) & \
(m_12["0_t1"] != m_12["2_t2"]) & (m_12["1_t1"] != m_12["2_t2"]) & (m_12["2_t1"] != m_12["2_t2"])
m_12 = m_12[cond]
# create 2-3 match
t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1),
on = "dummy", suffixes=["_t2", "_t3"])
m_23 = t[(t.index_t2 < t.index_t3) & (t["1_t2"] < t["2_t3"])]
cond = (m_23["0_t2"] != m_23["0_t3"]) & (m_23["1_t2"] != m_23["0_t3"]) & (m_23["2_t2"] != m_23["0_t3"]) & \
(m_23["0_t2"] != m_23["1_t3"]) & (m_23["1_t2"] != m_23["1_t3"]) & (m_23["2_t2"] != m_23["1_t3"]) & \
(m_23["0_t2"] != m_23["2_t3"]) & (m_23["1_t2"] != m_23["2_t3"]) & (m_23["2_t2"] != m_23["2_t3"])
m_23 = m_23[cond]
# create 1-3 match
on = "dummy", suffixes=["_t1", "_t3"])
m_13 = t[(t.index_t1 < t.index_t3) & (t["0_t1"] < t["2_t3"])]
cond = (m_13["0_t1"] != m_13["0_t3"]) & (m_13["1_t1"] != m_13["0_t3"]) & (m_13["2_t1"] != m_13["0_t3"]) & \
(m_13["0_t1"] != m_13["1_t3"]) & (m_13["1_t1"] != m_13["1_t3"]) & (m_13["2_t1"] != m_13["1_t3"]) & \
(m_13["0_t1"] != m_13["2_t3"]) & (m_13["1_t1"] != m_13["2_t3"]) & (m_13["2_t1"] != m_13["2_t3"])
m_13 = m_13[cond]
# Merge 1-2 and 2-3
m_123 = pd.merge(m_12[["index_t1", "index_t2"]], m_23[["index_t2", "index_t3"]], on="index_t2")
# filter by valid 1-3 matches
res = pd.merge(m_123, m_13[["index_t1", "index_t3"]], on = ["index_t1", "index_t3"], how = "inner")
总而言之,这在我的笔记本电脑上大约需要 1:15 分钟(但需要大量内存)。