如何从历史数据中获取最频繁的路线？

Question

我在Hive 1.2.1中有如下数据（我的真实数据集要大很多，但数据结构类似）：

id    radar_id     car_id     datetime
1     A21          123        2017-03-08 17:31:19.0
2     A21          555        2017-03-08 17:32:00.0
3     A21          777        2017-03-08 17:33:00.0
4     B15          123        2017-03-08 17:35:22.0
5     B15          555        2017-03-08 17:34:05.0
6     B15          777        2017-03-08 20:50:12.0
7     C09          777        2017-03-08 20:55:00.0
8     A21          123        2017-03-09 11:00:00.0
9     C11          664        2017-03-09 11:10:00.0
10    A21          123        2017-03-09 11:12:00.0
11    A21          555        2017-03-09 11:12:10.0
12    B15          123        2017-03-09 11:14:00.0
13    B15          555        2017-03-09 11:20:00.0
14    A21          444        2017-03-09 10:00:00.0
15    C09          444        2017-03-09 10:20:00.0
16    B15          444        2017-03-09 10:05:00.0

我想获得最常使用的前 2 条路线。路线是由 datetime 排序的 radar_id 序列。我想得到如下结果：

route          frequency
A21->B15       2
A21->B15-C09   1

频度是车辆（非唯一，无需考虑car_id）每天通过一条路线的平均次数。对于路线 A21->B15，频率为 2，因为 2017-03-08 有 3 次骑行，2017-03-09 有 1 次骑行。车辆 123 在日期 2017-03-09 行驶了 A21->A21->B15 路线，这一点很重要。它与 A21->B15 不同。所以，我想考虑从最初的雷达到白天捕获该车辆的最终雷达的路线。

骑行从 23:55 开始并在 00:22 结束的情况应视为两条不同的路线。

我如何使用 Hive 1.2.1 来做到这一点？

更新：

正如答案中所建议的，我测试了这个查询，但 route 不包含 ->。 route 的值类似于 000021 或 0450001，等等

df = sqlContext.sql("select      regexp_replace(route,'(?<=^|->)\d{5}','')  as route " +
                                      ",count(*) / min(days)                        as frequency " +

                           "from       (select      concat_ws('->',sort_array(collect_list(radarids))) as route " +
                                                  ",count(distinct dt) over()                           as days " +
                                       "from       (select  car_id " +
                                                  ",to_date(datetime)   as dt " +
                                                  ",concat(printf('%05d',row_number() over " +
                                                  "(partition by car_id,to_date(datetime) " +
                                                  "order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
                                                  "from    mytable " +
                                                  ") t " +
                                       "group by    car_id " +
                                      ",dt " +
                                      ") t " +
                           "group by    route " +      
                           "order by    frequency desc " +
                           "limit       5")

Answer 1

从documentation看来，HIVE不支持递归CTE，幸好它支持子查询，group by类，row_number解析函数，trunc(string date, string format)函数, concat 函数和 LIMIT x 子句。
我无权访问 Hive，但我可以展示如何在 PostgreSQL 上构建这样的查询，它们之间只有细微差别，所以我相信您会设法重写它。我认为唯一要替换的是 postgres 中的 date_trunc('day', datetim ) 函数和 Hive 中的 trunc(datetim , 'DD')。

SELECT route, avg( cnt ) as average
FROM (
        SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
               count(*) as cnt
        FROM (
                SELECT date_trunc('day', datetim ) As datetim, car_id,
                    max( case when rn = 1 then radar_id end ) as route1,
                    max( case when rn = 2 then radar_id end ) as route2,
                    max( case when rn = 3 then radar_id end ) as route3,
                    max( case when rn = 4 then radar_id end ) as route4
                /*  max( case when rn = 5 then radar_id end ) as route5
                    ......
                    max( case when rn = N then radar_id end ) as routeN */
                FROM (
                    select t.*,
                           row_number() over (
                               partition by date_trunc('day', datetim ),car_id 
                               order by datetim 
                           ) as rn
                    from table111 t
                ) x
                GROUP BY date_trunc('day', datetim ), car_id
        ) x
        group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;

演示：http://sqlfiddle.com/#!15/53c7e/27

|        route | average |
|--------------|---------|
|    A21>B15>> |       3 |
| A21>B15>C09> |       2 |

Answer 2

select      regexp_replace(route,'(?<=^|->)\d{5}','')  as route
           ,count(*) / min(days)                        as frequency

from       (select      concat_ws('->',sort_array(collect_list(radar_ids))) as route
                       ,count(distinct dt) over()                           as days
            from       (select  car_id
                               ,to_date(datetime)   as dt
                               ,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
                        from    mytable
                        ) t
            group by    car_id
                       ,dt
            ) t
group by    route          
order by    frequency desc
limit       2 
;

+---------------+-----------+
| route         | frequency |
+---------------+-----------+
| A21->B15      | 1.5       |
+---------------+-----------+
| A21->B15->C09 | 1.0       |
+---------------+-----------+

如何从历史数据中获取最频繁的路线？

How to obtain te most frequent routes from historical data?

sql

hive

hiveql