如何解析树中常见的 children 并在 Python 中给它们唯一的名称？

Question

我在 parent/child 关系中表示了以下树：

import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]

一些节点有多个parents。这应该通过根据路径为这些常见的 children 不同的 ID 来删除。这是（右树）之后的样子：

我写了一个函数，应该为每个节点写一个路径并将其添加到名称中，结果收集在字典中"res"。经过几天的尝试，这似乎是一种糟糕的方法，因为它不会拆分路径。下面是节点 H.

的示例

知道如何转换树吗？

res = {}
def find_parent(child, path):

    path.append(str(child))
    print("Path:    ", path)

    parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
    print("Parents: ",parents)

    if not parents:
        print("Path end reached!")
        print("Result: ", res)
        # i+1

    else :
        for i in range(0,len(parents)-1):
            if len(parents)>1: #dann neue paths
                path = [(str(child))]

                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()
                find_parent(parents[i], path)

            else: 
                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()

                find_parent(parents[0],path)

    return res

节点 "H"

的示例结果

find_parent("H", [])

{'path_0': 'FB'}

它应该给出 H_FBA、HFCA 和 H_GCA。

Answer 1

您可以使用各种典型的图形转换来执行此操作。在深入研究代码之前，我们在原始数据帧中使用带有一个源节点（即没有传入边的节点）的 directed acyclic graph。这些属性大大简化了事情，并保证我们可以创建一个唯一的树，如您所需的输出所示。

平面数据框作为图表并不容易操作，所以我采取的第一步是将其变成 adjacency list。

接下来，需要执行几个准备步骤：

找到源节点（虽然我使用了一个处理多个源的通用图形函数，但我们保证只有一个源，所以我们将从返回的集合中取出唯一的项目）。单独的源节点将成为新树的根。
运行一个从源开始提取 DAG 中所有路径的函数。
创建一个将节点名称映射到传入边计数的字典，这有助于指导重命名过程。
创建一个重命名字典，让我们跟踪哪些节点已重命名。

构建这些结构后，迭代图中所有可能的路径以构建重新标记的图，根据您的规范格式化具有多个传入边的节点（反向路径节点列表）并将新名称存储在重命名字典中.

最后，对这个重新标记的图的扁平化版本进行排序，并将其转储到结果数据框中。

代码如下：

import pandas as pd
from collections import defaultdict

def find_source_nodes(graph):
    sources = set(graph)

    for neighbors in graph.values():
        sources = sources - neighbors

    return sources

def count_incoming_edges(graph):
    counts = defaultdict(int)

    for neighbors in graph.values():
        for neighbor in neighbors:
            counts[neighbor] += 1

    return counts

def find_all_paths_in_dag(graph, src, path=[]):
    path.append(src)

    if src in graph and graph[src]:
        for neighbor in graph[src]:
            yield from find_all_paths_in_dag(graph, neighbor, path)
    else:
        yield path[:]

    path.pop()

def flatten_adjacency_list(adj_list):
    flat_graph = []

    for parent, children in adj_list.items():
        flat_graph.extend([(parent, child) for child in children])

    return flat_graph

def relabel_dag(graph, root):
    relabeled_graph = defaultdict(set)
    all_paths = find_all_paths_in_dag(graph, root)
    incoming_edge_counts = count_incoming_edges(graph)
    renamed = {k: k for k in graph}

    for path in all_paths:
        for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
            if incoming_edge_counts[dst] > 1:
                renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"

            relabeled_graph[renamed[src]].add(dst)

    return relabeled_graph

if __name__ == "__main__":
    df = pd.DataFrame(columns=["Parent", "Child"])
    df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
    df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
    graph = defaultdict(set)

    for parent, child in zip(df["Parent"], df["Child"]):
        graph[parent].add(child)

    root = next(iter(find_source_nodes(graph)))
    relabeled_tree = relabel_dag(graph, root)
    flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
    relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
    print(relabeled_df)

输出：

   Parent  Child
0       A      B
1       A      C
2       A    E_A
3       B      D
4       B   E_BA
5       B   F_BA
6       C   F_CA
7       C      G
8    F_BA  H_FBA
9    F_CA  H_FCA
10      G  H_GCA
11      G      I

如何解析树中常见的 children 并在 Python 中给它们唯一的名称？

How to resolve common children in tree and give them unique names in Python?

python

tree

recursion

parent-child