如何使用 Hive/Pig/MapReduce 展平递归层次结构
How to flatten recursive hierarchy using Hive/Pig/MapReduce
我有以表格格式存储的不平衡树数据,例如:
parent,child
a,b
b,c
c,d
c,f
f,g
树的深度未知。
如何展平此层次结构,其中每一行包含一行中从叶节点到根节点的完整路径,如:
leaf node, root node, intermediate nodes
d,a,d:c:b
f,a,e:b
对使用 hive、pig 或 mapreduce 解决上述问题有什么建议吗?提前致谢。
我尝试用pig来解决,这里是示例代码:
连接函数:
-- Join parent and child
Define join_hierarchy ( leftA, source, result) returns output {
joined= join $leftA by parent left, $source by child;
tmp_filtered= filter joined by source::parent is null;
part= foreach tmp_filtered leftA::child as child, leftA::path as path;
$result= union part, $result;
part_remaining= filter joined by source::parent is not null;
$output= foreach part_remaining generate $leftA::child as child, source::parent as parent, concat(concat(source::parent,':'),$leftA::path)
}
加载数据集:
--My dataset field delimiter is ','.
source= load '*****' using pigStorage(',') as (parent:chararray, child:chararray);
--create additional column for path
leftA= foreach source generate child, parent, concat(parent,':');
--initially result table will be blank.
result= limit leftA 1;
result= foreach result generate '' as child , '' as parent;
--Flatten hierarchy to 4 levels. Add below lines equivalent to hierarchy depth.
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
我有以表格格式存储的不平衡树数据,例如:
parent,child
a,b
b,c
c,d
c,f
f,g
树的深度未知。
如何展平此层次结构,其中每一行包含一行中从叶节点到根节点的完整路径,如:
leaf node, root node, intermediate nodes
d,a,d:c:b
f,a,e:b
对使用 hive、pig 或 mapreduce 解决上述问题有什么建议吗?提前致谢。
我尝试用pig来解决,这里是示例代码:
连接函数:
-- Join parent and child
Define join_hierarchy ( leftA, source, result) returns output {
joined= join $leftA by parent left, $source by child;
tmp_filtered= filter joined by source::parent is null;
part= foreach tmp_filtered leftA::child as child, leftA::path as path;
$result= union part, $result;
part_remaining= filter joined by source::parent is not null;
$output= foreach part_remaining generate $leftA::child as child, source::parent as parent, concat(concat(source::parent,':'),$leftA::path)
}
加载数据集:
--My dataset field delimiter is ','.
source= load '*****' using pigStorage(',') as (parent:chararray, child:chararray);
--create additional column for path
leftA= foreach source generate child, parent, concat(parent,':');
--initially result table will be blank.
result= limit leftA 1;
result= foreach result generate '' as child , '' as parent;
--Flatten hierarchy to 4 levels. Add below lines equivalent to hierarchy depth.
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);