查找配置单元中每个 id 的平均小时数

Find average hours for each id in hive

我的数据集如下所示:-

 Id      working_hour
1005    2019-10-23 08:35:00
1006    2019-10-23 00:54:59
1007    2019-10-23 00:24:57
1008    2019-10-23 06:40:00
1009    2019-10-23 03:50:00
1010    2019-10-23 03:25:01
1005    2019-10-24 05:25:00
1006    2019-10-24 01:39:59
1007    2019-10-24 02:30:00
1008    2019-10-24 09:45:01
1010    2019-10-24 07:00:00

这是两天的数据集(23/10/2019 和 24/10/2019)。我想要找到每个 Id 的平均工作时间(以小时或分钟为单位)。

赞:-

 Id    in_hours  in_mins
1005      7       420     # (08:35+3:35)/2 = 7 hours
1006    1.29    77.4835   # (00:54:59+01:39:59)/2 = 1.29 hours

使用窗口函数。超前和滞后特别有助于此用例。我没有执行此 sql 但概念是存在的。

Select (id, working_ho, nextwH)
from (
Select id, working_hour, lead(working_hour) over partition_by id order_by working hour) nextWH
from tableA)

这将产生如下所示的数据。
编号 |working_hour |下一个WH

1005|2019-10-23 08:35:00|2019-10-24 05:25:00

1005|2019-10-24 05:25:00|null

然后过滤掉nextWH为null的记录,用日期时间函数根据自己的喜好计算working_hour和nextWH的差值

这里是 link 窗口函数文档。

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-LEADusingdefault1rowleadandnotspecifyingdefaultvalue

您可以尝试以下方法

你的数据 working_hour 作为时间戳

+------------------+----------------------------+--+
| working_hour.id  | working_hour.working_hour  |
+------------------+----------------------------+--+
| 1005             | 2019-10-23 08:35:00.0      |
| 1006             | 2019-10-23 00:54:59.0      |
| 1007             | 2019-10-23 00:24:57.0      |
| 1008             | 2019-10-23 06:40:00.0      |
| 1009             | 2019-10-23 03:50:00.0      |
| 1010             | 2019-10-23 03:25:01.0      |
| 1005             | 2019-10-24 05:25:00.0      |
| 1006             | 2019-10-24 01:39:59.0      |
| 1007             | 2019-10-24 02:30:00.0      |
| 1008             | 2019-10-24 09:45:01.0      |
| 1009             | 2019-10-24 02:10:00.0      |
| 1010             | 2019-10-24 07:00:00.0      |
+------------------+----------------------------+--+

用窗口函数引导并转换时间戳以秒为单位,计算两个时间戳以秒为单位的差值并将秒转换为分钟和小时。

WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay, 
         ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 2, 2) AS in_secs, --AVG in seconds
         ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 2,2) AS in_mins, --AVG in minutes
         ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 60 / 2,2) AS in_hours --AVG in hours
FROM t
WHERE nextDay IS NOT NULL;

并输出

+-------+------------------------+------------------------+----------+----------+-----------+--+
|  id   |      working_hour      |        nextday         | in_secs  | in_mins  | in_hours  |
+-------+------------------------+------------------------+----------+----------+-----------+--+
| 1005  | 2019-10-23 08:35:00.0  | 2019-10-24 05:25:00.0  | 37500.0  | 625.0    | 10.42     |
| 1006  | 2019-10-23 00:54:59.0  | 2019-10-24 01:39:59.0  | 44550.0  | 742.5    | 12.38     |
| 1007  | 2019-10-23 00:24:57.0  | 2019-10-24 02:30:00.0  | 46951.5  | 782.53   | 13.04     |
| 1008  | 2019-10-23 06:40:00.0  | 2019-10-24 09:45:01.0  | 48750.5  | 812.51   | 13.54     |
| 1009  | 2019-10-23 03:50:00.0  | 2019-10-24 02:10:00.0  | 40200.0  | 670.0    | 11.17     |
| 1010  | 2019-10-23 03:25:01.0  | 2019-10-24 07:00:00.0  | 49649.5  | 827.49   | 13.79     |
+-------+------------------------+------------------------+----------+----------+-----------+--+

你也可以采用这种方法

WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay, 
          ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 60 / 2),2) AS in_hours,
          ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 2),2) AS in_mins
FROM t
WHERE nextDay IS NOT NULL;

输出

+-------+------------------------+------------------------+-----------+----------+--+
|  id   |      working_hour      |        nextday         | in_hours  | in_mins  |
+-------+------------------------+------------------------+-----------+----------+--+
| 1005  | 2019-10-23 08:35:00.0  | 2019-10-24 05:25:00.0  | 7.0       | 420.0    |
| 1006  | 2019-10-23 00:54:59.0  | 2019-10-24 01:39:59.0  | 1.28      | 76.5     |
| 1007  | 2019-10-23 00:24:57.0  | 2019-10-24 02:30:00.0  | 1.45      | 87.0     |
| 1008  | 2019-10-23 06:40:00.0  | 2019-10-24 09:45:01.0  | 8.21      | 492.5    |
| 1009  | 2019-10-23 03:50:00.0  | 2019-10-24 02:10:00.0  | 3.0       | 180.0    |
| 1010  | 2019-10-23 03:25:01.0  | 2019-10-24 07:00:00.0  | 5.21      | 312.5    |
+-------+------------------------+------------------------+-----------+----------+--+

希望对您有所帮助。

我正在尽可能简单地使用它。对我来说,它工作正常。

SELECT user_name, from_unixtime(CAST(AVG(unix_timestamp(substr(working_hours,12),"HH:mm:ss"))as bigint),"HH:mm:ss") as avg_hours FROM workinglogs1 GROUP BY user_name ORDER BY avg_hours'

在这里,我使用 substr(working_hours,12) 从工作时间只计算 HH:mm:ss,然后找到工作时间的 unix_time 戳记。之后我取平均值并使用 from_unixtime.

转换为时间戳