Pig:统计多列的频率
Pig: Count frequency of multiple columns
我想计算猪中 2 个字段组合的频率:
------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);
在上面,我想计算(user_id, tweet_created_at
)字段组合的频率。
行a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
报错:Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"
我试过没有括号:a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;
我收到另一个错误:
Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................
这是语法错误还是我的数据有问题?
如果是语法错误,正确的做法是什么?
简而言之:我想计算用户在发布每条推文时给出的回复数。 (如果他在同一天发布了 2 条推文,他可能在第一条推文时的回复计数为 10,在第二条推文时的回复计数为 15)。我想如果我不按 tweet_created_at
分组,回复计数将始终是一个常数,这是错误的。
在组上使用 FLATTEN 将元组取消嵌套到字段中
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
我想计算猪中 2 个字段组合的频率:
------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);
在上面,我想计算(user_id, tweet_created_at
)字段组合的频率。
行a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
报错:Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"
我试过没有括号:a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;
我收到另一个错误:
Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................
这是语法错误还是我的数据有问题? 如果是语法错误,正确的做法是什么?
简而言之:我想计算用户在发布每条推文时给出的回复数。 (如果他在同一天发布了 2 条推文,他可能在第一条推文时的回复计数为 10,在第二条推文时的回复计数为 15)。我想如果我不按 tweet_created_at
分组,回复计数将始终是一个常数,这是错误的。
在组上使用 FLATTEN 将元组取消嵌套到字段中
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;