如何解析树中常见的 children 并在 Python 中给它们唯一的名称?
How to resolve common children in tree and give them unique names in Python?
我在 parent/child 关系中表示了以下树:
import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]
一些节点有多个parents。这应该通过根据路径为这些常见的 children 不同的 ID 来删除。这是(右树)之后的样子:
我写了一个函数,应该为每个节点写一个路径并将其添加到名称中,结果收集在字典中"res"。经过几天的尝试,这似乎是一种糟糕的方法,因为它不会拆分路径。下面是节点 H.
的示例
知道如何转换树吗?
res = {}
def find_parent(child, path):
path.append(str(child))
print("Path: ", path)
parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
print("Parents: ",parents)
if not parents:
print("Path end reached!")
print("Result: ", res)
# i+1
else :
for i in range(0,len(parents)-1):
if len(parents)>1: #dann neue paths
path = [(str(child))]
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[i], path)
else:
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[0],path)
return res
节点 "H"
的示例结果
find_parent("H", [])
{'path_0': 'FB'}
它应该给出 H_FBA、HFCA 和 H_GCA。
您可以使用各种典型的图形转换来执行此操作。在深入研究代码之前,我们在原始数据帧中使用带有一个源节点(即没有传入边的节点)的 directed acyclic graph。这些属性大大简化了事情,并保证我们可以创建一个唯一的树,如您所需的输出所示。
平面数据框作为图表并不容易操作,所以我采取的第一步是将其变成 adjacency list。
接下来,需要执行几个准备步骤:
- 找到源节点(虽然我使用了一个处理多个源的通用图形函数,但我们保证只有一个源,所以我们将从返回的集合中取出唯一的项目)。单独的源节点将成为新树的根。
- 运行 一个从源开始提取 DAG 中所有路径的函数。
- 创建一个将节点名称映射到传入边计数的字典,这有助于指导重命名过程。
- 创建一个重命名字典,让我们跟踪哪些节点已重命名。
构建这些结构后,迭代图中所有可能的路径以构建重新标记的图,根据您的规范格式化具有多个传入边的节点(反向路径节点列表)并将新名称存储在重命名字典中.
最后,对这个重新标记的图的扁平化版本进行排序,并将其转储到结果数据框中。
代码如下:
import pandas as pd
from collections import defaultdict
def find_source_nodes(graph):
sources = set(graph)
for neighbors in graph.values():
sources = sources - neighbors
return sources
def count_incoming_edges(graph):
counts = defaultdict(int)
for neighbors in graph.values():
for neighbor in neighbors:
counts[neighbor] += 1
return counts
def find_all_paths_in_dag(graph, src, path=[]):
path.append(src)
if src in graph and graph[src]:
for neighbor in graph[src]:
yield from find_all_paths_in_dag(graph, neighbor, path)
else:
yield path[:]
path.pop()
def flatten_adjacency_list(adj_list):
flat_graph = []
for parent, children in adj_list.items():
flat_graph.extend([(parent, child) for child in children])
return flat_graph
def relabel_dag(graph, root):
relabeled_graph = defaultdict(set)
all_paths = find_all_paths_in_dag(graph, root)
incoming_edge_counts = count_incoming_edges(graph)
renamed = {k: k for k in graph}
for path in all_paths:
for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
if incoming_edge_counts[dst] > 1:
renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"
relabeled_graph[renamed[src]].add(dst)
return relabeled_graph
if __name__ == "__main__":
df = pd.DataFrame(columns=["Parent", "Child"])
df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
graph = defaultdict(set)
for parent, child in zip(df["Parent"], df["Child"]):
graph[parent].add(child)
root = next(iter(find_source_nodes(graph)))
relabeled_tree = relabel_dag(graph, root)
flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
print(relabeled_df)
输出:
Parent Child
0 A B
1 A C
2 A E_A
3 B D
4 B E_BA
5 B F_BA
6 C F_CA
7 C G
8 F_BA H_FBA
9 F_CA H_FCA
10 G H_GCA
11 G I
我在 parent/child 关系中表示了以下树:
import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]
一些节点有多个parents。这应该通过根据路径为这些常见的 children 不同的 ID 来删除。这是(右树)之后的样子:
我写了一个函数,应该为每个节点写一个路径并将其添加到名称中,结果收集在字典中"res"。经过几天的尝试,这似乎是一种糟糕的方法,因为它不会拆分路径。下面是节点 H.
的示例知道如何转换树吗?
res = {}
def find_parent(child, path):
path.append(str(child))
print("Path: ", path)
parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
print("Parents: ",parents)
if not parents:
print("Path end reached!")
print("Result: ", res)
# i+1
else :
for i in range(0,len(parents)-1):
if len(parents)>1: #dann neue paths
path = [(str(child))]
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[i], path)
else:
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[0],path)
return res
节点 "H"
的示例结果find_parent("H", [])
{'path_0': 'FB'}
它应该给出 H_FBA、HFCA 和 H_GCA。
您可以使用各种典型的图形转换来执行此操作。在深入研究代码之前,我们在原始数据帧中使用带有一个源节点(即没有传入边的节点)的 directed acyclic graph。这些属性大大简化了事情,并保证我们可以创建一个唯一的树,如您所需的输出所示。
平面数据框作为图表并不容易操作,所以我采取的第一步是将其变成 adjacency list。
接下来,需要执行几个准备步骤:
- 找到源节点(虽然我使用了一个处理多个源的通用图形函数,但我们保证只有一个源,所以我们将从返回的集合中取出唯一的项目)。单独的源节点将成为新树的根。
- 运行 一个从源开始提取 DAG 中所有路径的函数。
- 创建一个将节点名称映射到传入边计数的字典,这有助于指导重命名过程。
- 创建一个重命名字典,让我们跟踪哪些节点已重命名。
构建这些结构后,迭代图中所有可能的路径以构建重新标记的图,根据您的规范格式化具有多个传入边的节点(反向路径节点列表)并将新名称存储在重命名字典中.
最后,对这个重新标记的图的扁平化版本进行排序,并将其转储到结果数据框中。
代码如下:
import pandas as pd
from collections import defaultdict
def find_source_nodes(graph):
sources = set(graph)
for neighbors in graph.values():
sources = sources - neighbors
return sources
def count_incoming_edges(graph):
counts = defaultdict(int)
for neighbors in graph.values():
for neighbor in neighbors:
counts[neighbor] += 1
return counts
def find_all_paths_in_dag(graph, src, path=[]):
path.append(src)
if src in graph and graph[src]:
for neighbor in graph[src]:
yield from find_all_paths_in_dag(graph, neighbor, path)
else:
yield path[:]
path.pop()
def flatten_adjacency_list(adj_list):
flat_graph = []
for parent, children in adj_list.items():
flat_graph.extend([(parent, child) for child in children])
return flat_graph
def relabel_dag(graph, root):
relabeled_graph = defaultdict(set)
all_paths = find_all_paths_in_dag(graph, root)
incoming_edge_counts = count_incoming_edges(graph)
renamed = {k: k for k in graph}
for path in all_paths:
for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
if incoming_edge_counts[dst] > 1:
renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"
relabeled_graph[renamed[src]].add(dst)
return relabeled_graph
if __name__ == "__main__":
df = pd.DataFrame(columns=["Parent", "Child"])
df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
graph = defaultdict(set)
for parent, child in zip(df["Parent"], df["Child"]):
graph[parent].add(child)
root = next(iter(find_source_nodes(graph)))
relabeled_tree = relabel_dag(graph, root)
flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
print(relabeled_df)
输出:
Parent Child
0 A B
1 A C
2 A E_A
3 B D
4 B E_BA
5 B F_BA
6 C F_CA
7 C G
8 F_BA H_FBA
9 F_CA H_FCA
10 G H_GCA
11 G I