如何从历史数据中获取最频繁的路线?
How to obtain te most frequent routes from historical data?
我在Hive 1.2.1中有如下数据(我的真实数据集要大很多,但数据结构类似):
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
6 B15 777 2017-03-08 20:50:12.0
7 C09 777 2017-03-08 20:55:00.0
8 A21 123 2017-03-09 11:00:00.0
9 C11 664 2017-03-09 11:10:00.0
10 A21 123 2017-03-09 11:12:00.0
11 A21 555 2017-03-09 11:12:10.0
12 B15 123 2017-03-09 11:14:00.0
13 B15 555 2017-03-09 11:20:00.0
14 A21 444 2017-03-09 10:00:00.0
15 C09 444 2017-03-09 10:20:00.0
16 B15 444 2017-03-09 10:05:00.0
我想获得最常使用的前 2 条路线。路线是由 datetime
排序的 radar_id
序列。我想得到如下结果:
route frequency
A21->B15 2
A21->B15-C09 1
频度是车辆(非唯一,无需考虑car_id
)每天通过一条路线的平均次数。
对于路线 A21->B15
,频率为 2,因为 2017-03-08
有 3 次骑行,2017-03-09
有 1 次骑行。车辆 123
在日期 2017-03-09
行驶了 A21->A21->B15
路线,这一点很重要。它与 A21->B15
不同。所以,我想考虑从最初的雷达到白天捕获该车辆的最终雷达的路线。
骑行从 23:55 开始并在 00:22 结束的情况应视为两条不同的路线。
我如何使用 Hive 1.2.1 来做到这一点?
更新:
正如答案中所建议的,我测试了这个查询,但 route
不包含 ->
。 route 的值类似于 000021
或 0450001
,等等
df = sqlContext.sql("select regexp_replace(route,'(?<=^|->)\d{5}','') as route " +
",count(*) / min(days) as frequency " +
"from (select concat_ws('->',sort_array(collect_list(radarids))) as route " +
",count(distinct dt) over() as days " +
"from (select car_id " +
",to_date(datetime) as dt " +
",concat(printf('%05d',row_number() over " +
"(partition by car_id,to_date(datetime) " +
"order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
"from mytable " +
") t " +
"group by car_id " +
",dt " +
") t " +
"group by route " +
"order by frequency desc " +
"limit 5")
从documentation看来,HIVE不支持递归CTE,幸好它支持子查询,group by
类,row_number
解析函数,trunc(string date, string format)
函数, concat
函数和 LIMIT x
子句。
我无权访问 Hive,但我可以展示如何在 PostgreSQL 上构建这样的查询,它们之间只有细微差别,所以我相信您会设法重写它。我认为唯一要替换的是 postgres 中的 date_trunc('day', datetim )
函数和 Hive 中的 trunc(datetim , 'DD')
。
SELECT route, avg( cnt ) as average
FROM (
SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
count(*) as cnt
FROM (
SELECT date_trunc('day', datetim ) As datetim, car_id,
max( case when rn = 1 then radar_id end ) as route1,
max( case when rn = 2 then radar_id end ) as route2,
max( case when rn = 3 then radar_id end ) as route3,
max( case when rn = 4 then radar_id end ) as route4
/* max( case when rn = 5 then radar_id end ) as route5
......
max( case when rn = N then radar_id end ) as routeN */
FROM (
select t.*,
row_number() over (
partition by date_trunc('day', datetim ),car_id
order by datetim
) as rn
from table111 t
) x
GROUP BY date_trunc('day', datetim ), car_id
) x
group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;
演示:http://sqlfiddle.com/#!15/53c7e/27
| route | average |
|--------------|---------|
| A21>B15>> | 3 |
| A21>B15>C09> | 2 |
select regexp_replace(route,'(?<=^|->)\d{5}','') as route
,count(*) / min(days) as frequency
from (select concat_ws('->',sort_array(collect_list(radar_ids))) as route
,count(distinct dt) over() as days
from (select car_id
,to_date(datetime) as dt
,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
from mytable
) t
group by car_id
,dt
) t
group by route
order by frequency desc
limit 2
;
+---------------+-----------+
| route | frequency |
+---------------+-----------+
| A21->B15 | 1.5 |
+---------------+-----------+
| A21->B15->C09 | 1.0 |
+---------------+-----------+
我在Hive 1.2.1中有如下数据(我的真实数据集要大很多,但数据结构类似):
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
6 B15 777 2017-03-08 20:50:12.0
7 C09 777 2017-03-08 20:55:00.0
8 A21 123 2017-03-09 11:00:00.0
9 C11 664 2017-03-09 11:10:00.0
10 A21 123 2017-03-09 11:12:00.0
11 A21 555 2017-03-09 11:12:10.0
12 B15 123 2017-03-09 11:14:00.0
13 B15 555 2017-03-09 11:20:00.0
14 A21 444 2017-03-09 10:00:00.0
15 C09 444 2017-03-09 10:20:00.0
16 B15 444 2017-03-09 10:05:00.0
我想获得最常使用的前 2 条路线。路线是由 datetime
排序的 radar_id
序列。我想得到如下结果:
route frequency
A21->B15 2
A21->B15-C09 1
频度是车辆(非唯一,无需考虑car_id
)每天通过一条路线的平均次数。
对于路线 A21->B15
,频率为 2,因为 2017-03-08
有 3 次骑行,2017-03-09
有 1 次骑行。车辆 123
在日期 2017-03-09
行驶了 A21->A21->B15
路线,这一点很重要。它与 A21->B15
不同。所以,我想考虑从最初的雷达到白天捕获该车辆的最终雷达的路线。
骑行从 23:55 开始并在 00:22 结束的情况应视为两条不同的路线。
我如何使用 Hive 1.2.1 来做到这一点?
更新:
正如答案中所建议的,我测试了这个查询,但 route
不包含 ->
。 route 的值类似于 000021
或 0450001
,等等
df = sqlContext.sql("select regexp_replace(route,'(?<=^|->)\d{5}','') as route " +
",count(*) / min(days) as frequency " +
"from (select concat_ws('->',sort_array(collect_list(radarids))) as route " +
",count(distinct dt) over() as days " +
"from (select car_id " +
",to_date(datetime) as dt " +
",concat(printf('%05d',row_number() over " +
"(partition by car_id,to_date(datetime) " +
"order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
"from mytable " +
") t " +
"group by car_id " +
",dt " +
") t " +
"group by route " +
"order by frequency desc " +
"limit 5")
从documentation看来,HIVE不支持递归CTE,幸好它支持子查询,group by
类,row_number
解析函数,trunc(string date, string format)
函数, concat
函数和 LIMIT x
子句。
我无权访问 Hive,但我可以展示如何在 PostgreSQL 上构建这样的查询,它们之间只有细微差别,所以我相信您会设法重写它。我认为唯一要替换的是 postgres 中的 date_trunc('day', datetim )
函数和 Hive 中的 trunc(datetim , 'DD')
。
SELECT route, avg( cnt ) as average
FROM (
SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
count(*) as cnt
FROM (
SELECT date_trunc('day', datetim ) As datetim, car_id,
max( case when rn = 1 then radar_id end ) as route1,
max( case when rn = 2 then radar_id end ) as route2,
max( case when rn = 3 then radar_id end ) as route3,
max( case when rn = 4 then radar_id end ) as route4
/* max( case when rn = 5 then radar_id end ) as route5
......
max( case when rn = N then radar_id end ) as routeN */
FROM (
select t.*,
row_number() over (
partition by date_trunc('day', datetim ),car_id
order by datetim
) as rn
from table111 t
) x
GROUP BY date_trunc('day', datetim ), car_id
) x
group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;
演示:http://sqlfiddle.com/#!15/53c7e/27
| route | average |
|--------------|---------|
| A21>B15>> | 3 |
| A21>B15>C09> | 2 |
select regexp_replace(route,'(?<=^|->)\d{5}','') as route
,count(*) / min(days) as frequency
from (select concat_ws('->',sort_array(collect_list(radar_ids))) as route
,count(distinct dt) over() as days
from (select car_id
,to_date(datetime) as dt
,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
from mytable
) t
group by car_id
,dt
) t
group by route
order by frequency desc
limit 2
;
+---------------+-----------+
| route | frequency |
+---------------+-----------+
| A21->B15 | 1.5 |
+---------------+-----------+
| A21->B15->C09 | 1.0 |
+---------------+-----------+