基于树状模式创建新列

Question

我有以下数据框：

col1      col2               
basic      c                   
c          c++                 
c++        java                
ruby                                     
php                                      
java       python              
python                                   
r                                        
c#

我想根据数据框中遵循的模式创建新列。
例如，在上面的数据框中，可以从 col1 和 col2.

观察到 basic->c->c++->java->python 顺序

逻辑：

col1 值 basic 在 col2 中有 c 值，类似地 col1 中的 c 值对应于 col2 中的 c++，c++ 导致 col2 中的 java，最后 java 导致 col2 中的 python。
"col1" 中的剩余值在 col2 中有相应的空白，在新创建的列中也将留空。（也就是说，我们只考虑 "col1" 中的值，这些值在 col2 中没有空格）。

所以我的输出数据帧将是：

     col1     col2   new_col1  new_col2   new_col3   new_col4   
0   basic       c          c        c++       java     python
1       c     c++        c++       java     python         
2     c++    java       java     python                  
3    ruby                                              
4     php                                              
5    java  python     python                           
6  python                                             
7       r                                               
8       c

谢谢！

Answer 1

这可以通过graph theory analysis. It looks like you want to obtain all successors starting from each of the nodes in col2. For that we need to first build a directed graph using the columns col1 and col2. We can use networkX for that, and build a nx.DiGraph from the dataframe using nx.from_pandas_edgelist解决：

import networkx as nx

m = df.ne('').all(1)
G = nx.from_pandas_edgelist(df[m], 
                            source='col1', 
                            target='col2', 
                            create_using=nx.DiGraph())

然后我们可以遍历col2中的nodes，并搜索从该节点开始的所有后继节点。为此，我们可以使用 dfs_tree，它将遍历图以从源中进行深度优先搜索来搜索后继者：

all_successors = [list(nx.dfs_tree(G, node)) for node in df.loc[m,'col2']]

现在我们可以通过以下方式分配回最长路径列表：

out = (df.assign(
         **pd.DataFrame(all_successors, index=df[m].index)
         .reindex(df.index)
         .fillna('')
         .add_prefix('new_col')))

print(out)

     col1    col2   new_col0   new_col1   new_col2   new_col3
0   basic       c          c        c++       java     python
1       c     c++        c++       java     python         
2     c++    java       java     python                  
3    ruby                                              
4     php                                              
5    java  python     python                           
6  python                                             
7       r                                               
8       c

为了更好地解释这种方法，请考虑这个略有不同的网络，其中包含一个额外的组件：

如前所述，我们需要的是 Col2 中每个节点的后继列表。对于这些问题，有几种图搜索算法，可用于从给定节点开始探索图的分支。为此，我们可以使用 nx.algorithms.traversal 中可用的基于 depth first search 的函数。在这种情况下，我们想要 nx.dfs_tree，其中 returns 一个 oriented 树 通过从指定节点开始的深度优先搜索构造。

这里有一些例子：

list(nx.dfs_tree(G, 'c++'))
# ['c++', 'java', 'python', 'julia']

list(nx.dfs_tree(G, 'python'))
# ['python', 'julia']

list(nx.dfs_tree(G, 'basic'))
# ['basic', 'php', 'sql']

请注意，在图表中有循环的情况下，这可能会变得非常棘手。假设在 c++ 和 scala 之间有一条边。在这种情况下，不清楚应该选择哪条路径。一种方法是使用 nx.dfs_tree 遍历所有相应的路径，并保留感兴趣的路径预定义一些逻辑，例如保留最长的路径。尽管在这个问题中似乎并非如此。

基于树状模式创建新列

Create new columns based on a tree like pattern

python

graph-theory

graph

networkx

pandas