如何针对以下场景优化嵌套 for 循环或替代方法

How to optimize nested for loop or alternative approach for the below scenario

我有长度 [(4, 7, 10), (12, 40, 47), (2, 12, 23), (13, 16, 17), (37, 38, 47).... (41, 47, 48) 超过 2000 且长度为 3 的元组列表。我的 objective 是找到所有 3 个元组的集合

遵循这些规则:

  1. 第一个元组的第一个元素应小于第二个元组的第二个元素,第三个元组的第三个元素应小于第二个元组的第二个元素。 t[0][0] < t1[1][1] < t3[2][2] 示例:元组 1,3 和 4.(**4**, 7, 10), (2, **12**, 23), (13, 16, **17**)

**4 < 12 < 17**

  1. 集合中不应有任何重复值。例如:列表中的 1,2 & 4th 元组可以被认为是 (4, 7, 10) (12, 40, 47) & (13, 16, 17)。我们没有考虑第 3 个,因为它有一个元素 12,它已经存在于第二个元组 (12, 40, 47).

  2. 如果我们从索引 1 中选择第一个元组,我们应该从大于 1 的索引中检查第二个元组,并且第三个元组的索引应该大于第二个元组的索引。

下面是我的程序,问题是如果范围是 500+ 非常高超过 30 分钟到未知,则找到所有集合所花费的时间。请帮我优化一下。

    ts = datetime.now()  

    outcomes = loadSample("sample_data.txt")

    ne = outcomes[0:300]



    combinationsList = []

    for i in range(0,len(ne)):             
        for j in range(i+1,len(ne)):
            tt1 = []
            tt1.append(ne[i])            
            tt1.append(ne[j])
            if not (ne[i][0] < ne[j][1]) and (len(set((itertools.chain(*tt1)))) == 6): continue
            for k in range(j+1,len(ne)): 
                t1 = []                           
                                                        
                t = (i, j, k)
                
                t1.append(ne[i])
                t1.append(ne[j])
                t1.append(ne[k])
                
                
                if (ne[j][0] < ne[k][1]) and (len(set((itertools.chain(*t1)))) == 9): 

                    A=np.array([ne[i],ne[j],ne[k]])
                    
                    b = np.diag(A)
                    
                    is_sorted = lambda a: np.all(a[:-1] <= a[1:])
                    if is_sorted(b):
                        combinationsList.append(t)                        
                        
    print("Time to create : ",datetime.now() - ts) 

你可以在这里找到我用来测试的元组文件。

sample_data

非常感谢。

选项 1 - 基于字典

我认为这是一个优化的解决方案。总体思路是建立两个潜在匹配的字典:

  • 从第二个元组到第一个元组(match_21)。
  • 从第二个元组到第三个元组(match_23)。

此外,我们正在构建 tuple-1 和 tuple-3 (match_13) 之间的一组潜在匹配项。

然后我们遍历所有潜在的第二个元组,并通过检查它们是否存在于 match_13.

中来查找潜在 tuple1 列表和潜在 tuple3 列表中的项目之间的匹配项

import itertools
import datetime

ts = datetime.datetime.now()  

print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
    for line in f: 
        line = line.strip()
        line = line.replace("(", "").replace(")", "")
        t = tuple(int(x) for x in line.split(","))
        tuples.append(t)

print(f"loaded {len(tuples)} tuples")

print("Building match 1-2")
match_21 = {}
for t2 in range(1, len(tuples)-1):
    matches_for_t2 = []
    for t1 in range(t2):
        if tuples[t1][0] < tuples[t2][1] and set(tuples[t1]).isdisjoint(tuples[t2]):
            matches_for_t2.append(t1)
    match_21[t2] = matches_for_t2
print(len(match_21))

print("Building match 2-3")
match_23 = {}
for i in range(1, len(tuples)-1):
    matches_for_i = []
    for j in range(i+1, len(tuples)):
        if tuples[i][1] < tuples[j][2] and set(tuples[i]).isdisjoint(tuples[j]):
            matches_for_i.append(j)
    match_23[i] = matches_for_i

print(len(match_23))


print("Building match 1-3")
match_13 = set()
for t1 in range(0, len(tuples)-2):
    for t3 in range(t1 + 1, len(tuples)):
        if tuples[t1][0] < tuples[t3][2] and set(tuples[t1]).isdisjoint(tuples[t3]):
            match_13.add((t1, t3))
print(len(match_13))

print("Looking for matches")
counter = 0 
max_t2 = len(tuples)-1
# max_t2 = 20 # for debugging

matches = []
for t2 in range(1, max_t2):
    potential_t1 = match_21[t2]
    potential_t3 = match_23[t2]
    for t1, t3 in itertools.product(potential_t1, potential_t3):
        if (t1, t3) in match_13:
            match = (t1, t2, t3)
            matches.append(match)    

    print(".", end="")

print(len(matches))
print(matches[:10])

print("Time to create : ", datetime.datetime.now() - ts) 

这在我的笔记本电脑上大约需要 2:15 分钟。

选项 2 - 使用 pandas

另一种选择是利用 pandas 的强大功能来合并多个数据集。它更快,但并没有快得多。

print("Reading file")
tuples = []
with open("Sample_data.txt") as f:
    for line in f: 
        line = line.strip()
        line = line.replace("(", "").replace(")", "")
        t = tuple(int(x) for x in line.split(","))
        tuples.append(t)
tuple_df = pd.DataFrame.from_records(tuples)
tuple_df.reset_index(inplace = True)

t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1), 
             on = "dummy", suffixes=["_t1", "_t2"])

# create 1-2 match 
m_12 = t[(t.index_t1 < t.index_t2) & (t["0_t1"] < t["1_t2"])]

cond = (m_12["0_t1"] != m_12["0_t2"]) & (m_12["1_t1"] != m_12["0_t2"]) & (m_12["2_t1"] != m_12["0_t2"]) & \
(m_12["0_t1"] != m_12["1_t2"]) & (m_12["1_t1"] != m_12["1_t2"]) & (m_12["2_t1"] != m_12["1_t2"]) & \
(m_12["0_t1"] != m_12["2_t2"]) & (m_12["1_t1"] != m_12["2_t2"]) & (m_12["2_t1"] != m_12["2_t2"])

m_12 = m_12[cond]

# create 2-3 match 

t = pd.merge(tuple_df.assign(dummy = 1), tuple_df.assign(dummy = 1), 
             on = "dummy", suffixes=["_t2", "_t3"])

m_23 = t[(t.index_t2 < t.index_t3) & (t["1_t2"] < t["2_t3"])]

cond = (m_23["0_t2"] != m_23["0_t3"]) & (m_23["1_t2"] != m_23["0_t3"]) & (m_23["2_t2"] != m_23["0_t3"]) & \
(m_23["0_t2"] != m_23["1_t3"]) & (m_23["1_t2"] != m_23["1_t3"]) & (m_23["2_t2"] != m_23["1_t3"]) & \
(m_23["0_t2"] != m_23["2_t3"]) & (m_23["1_t2"] != m_23["2_t3"]) & (m_23["2_t2"] != m_23["2_t3"])

m_23 = m_23[cond]

# create 1-3 match

             on = "dummy", suffixes=["_t1", "_t3"])

m_13 = t[(t.index_t1 < t.index_t3) & (t["0_t1"] < t["2_t3"])]

cond = (m_13["0_t1"] != m_13["0_t3"]) & (m_13["1_t1"] != m_13["0_t3"]) & (m_13["2_t1"] != m_13["0_t3"]) & \
(m_13["0_t1"] != m_13["1_t3"]) & (m_13["1_t1"] != m_13["1_t3"]) & (m_13["2_t1"] != m_13["1_t3"]) & \
(m_13["0_t1"] != m_13["2_t3"]) & (m_13["1_t1"] != m_13["2_t3"]) & (m_13["2_t1"] != m_13["2_t3"])

m_13 = m_13[cond]

# Merge 1-2 and 2-3
m_123 = pd.merge(m_12[["index_t1", "index_t2"]], m_23[["index_t2", "index_t3"]], on="index_t2")

# filter by valid 1-3 matches 
res = pd.merge(m_123, m_13[["index_t1", "index_t3"]], on = ["index_t1", "index_t3"], how = "inner")

总而言之,这在我的笔记本电脑上大约需要 1:15 分钟(但需要大量内存)。