在 PIG 脚本中连接不同的记录
Concatenate different records in PIG script
我正在尝试用 PIG 编写脚本,我需要做的是获取一个数据集 - 包含用户 ID、日期、国家/地区代码和其他属性...
我想要的结果是按用户 ID 和日期分组,并且对于这样的每个组 - 在同一字段中连接国家代码...
例如:
user_id | date | country_code
1 2017-01-01 US
1 2017-01-01 UK
1 2017-01-02 FR
2 2017-01-02 RU
2 2017-01-03 DE
2 2017-01-03 AU
我想要的输出:
(1, 2017-01-01, "US,UK")
(1, 2017-01-02, FR)
(2, 2017-01-02, RU)
(2, 2017-01-03, "DE,AU")
@Hari Shankar this answer 实际上提出了一个措辞非常不同的问题,因为这个问题似乎不是重复的,我将 post 直接在此处回答:
grouped = GROUP table BY userid;
X = FOREACH grouped GENERATE group as userid,
table.clickcount as clicksbag,
table.pagenumber as pagenumberbag;
Now X
will be:
{(155,{(2),(3),(1)},{(12),(133),(144)},
(156,{(6),(7)},{(1),(5)}}
Now you need to use the builtin UDF BagToTuple:
output = FOREACH X GENERATE userid,
BagToTuple(clickbag) as clickcounts,
BagToTuple(pagenumberbag) as pagenumbers;
output
should now contain what you want. You can merge the output
step into the merge step as well:
output = FOREACH grouped GENERATE group as userid,
BagToTuple(table.clickcount) as clickcounts,
BagToTuple(table.pagenumber) as pagenumbers;
1:
http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/BagToTuple.html
我正在尝试用 PIG 编写脚本,我需要做的是获取一个数据集 - 包含用户 ID、日期、国家/地区代码和其他属性... 我想要的结果是按用户 ID 和日期分组,并且对于这样的每个组 - 在同一字段中连接国家代码...
例如:
user_id | date | country_code
1 2017-01-01 US
1 2017-01-01 UK
1 2017-01-02 FR
2 2017-01-02 RU
2 2017-01-03 DE
2 2017-01-03 AU
我想要的输出:
(1, 2017-01-01, "US,UK")
(1, 2017-01-02, FR)
(2, 2017-01-02, RU)
(2, 2017-01-03, "DE,AU")
@Hari Shankar this answer 实际上提出了一个措辞非常不同的问题,因为这个问题似乎不是重复的,我将 post 直接在此处回答:
grouped = GROUP table BY userid; X = FOREACH grouped GENERATE group as userid, table.clickcount as clicksbag, table.pagenumber as pagenumberbag;
Now
X
will be:{(155,{(2),(3),(1)},{(12),(133),(144)}, (156,{(6),(7)},{(1),(5)}}
Now you need to use the builtin UDF BagToTuple:
output = FOREACH X GENERATE userid, BagToTuple(clickbag) as clickcounts, BagToTuple(pagenumberbag) as pagenumbers;
output
should now contain what you want. You can merge the output step into the merge step as well:output = FOREACH grouped GENERATE group as userid, BagToTuple(table.clickcount) as clickcounts, BagToTuple(table.pagenumber) as pagenumbers;
1: http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/BagToTuple.html