Pig:统计多列的频率

Pig: Count frequency of multiple columns

我想计算猪中 2 个字段组合的频率:

------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);

在上面,我想计算(user_id, tweet_created_at)字段组合的频率。

a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;报错:Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"

我试过没有括号:a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;

我收到另一个错误:

Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................

这是语法错误还是我的数据有问题? 如果是语法错误,正确的做法是什么?

简而言之:我想计算用户在发布每条推文时给出的回复数。 (如果他在同一天发布了 2 条推文,他可能在第一条推文时的回复计数为 10,在第二条推文时的回复计数为 15)。我想如果我不按 tweet_created_at 分组,回复计数将始终是一个常数,这是错误的。

在组上使用 FLATTEN 将元组取消嵌套到字段中

a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;