查找配置单元中每个 id 的平均小时数
Find average hours for each id in hive
我的数据集如下所示:-
Id working_hour
1005 2019-10-23 08:35:00
1006 2019-10-23 00:54:59
1007 2019-10-23 00:24:57
1008 2019-10-23 06:40:00
1009 2019-10-23 03:50:00
1010 2019-10-23 03:25:01
1005 2019-10-24 05:25:00
1006 2019-10-24 01:39:59
1007 2019-10-24 02:30:00
1008 2019-10-24 09:45:01
1010 2019-10-24 07:00:00
这是两天的数据集(23/10/2019 和 24/10/2019)。我想要找到每个 Id 的平均工作时间(以小时或分钟为单位)。
赞:-
Id in_hours in_mins
1005 7 420 # (08:35+3:35)/2 = 7 hours
1006 1.29 77.4835 # (00:54:59+01:39:59)/2 = 1.29 hours
使用窗口函数。超前和滞后特别有助于此用例。我没有执行此 sql 但概念是存在的。
Select (id, working_ho, nextwH)
from (
Select id, working_hour, lead(working_hour) over partition_by id order_by working hour) nextWH
from tableA)
这将产生如下所示的数据。
编号 |working_hour |下一个WH
1005|2019-10-23 08:35:00|2019-10-24 05:25:00
1005|2019-10-24 05:25:00|null
然后过滤掉nextWH为null的记录,用日期时间函数根据自己的喜好计算working_hour和nextWH的差值
这里是 link 窗口函数文档。
您可以尝试以下方法
你的数据 working_hour 作为时间戳
+------------------+----------------------------+--+
| working_hour.id | working_hour.working_hour |
+------------------+----------------------------+--+
| 1005 | 2019-10-23 08:35:00.0 |
| 1006 | 2019-10-23 00:54:59.0 |
| 1007 | 2019-10-23 00:24:57.0 |
| 1008 | 2019-10-23 06:40:00.0 |
| 1009 | 2019-10-23 03:50:00.0 |
| 1010 | 2019-10-23 03:25:01.0 |
| 1005 | 2019-10-24 05:25:00.0 |
| 1006 | 2019-10-24 01:39:59.0 |
| 1007 | 2019-10-24 02:30:00.0 |
| 1008 | 2019-10-24 09:45:01.0 |
| 1009 | 2019-10-24 02:10:00.0 |
| 1010 | 2019-10-24 07:00:00.0 |
+------------------+----------------------------+--+
用窗口函数引导并转换时间戳以秒为单位,计算两个时间戳以秒为单位的差值并将秒转换为分钟和小时。
WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay,
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 2, 2) AS in_secs, --AVG in seconds
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 2,2) AS in_mins, --AVG in minutes
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 60 / 2,2) AS in_hours --AVG in hours
FROM t
WHERE nextDay IS NOT NULL;
并输出
+-------+------------------------+------------------------+----------+----------+-----------+--+
| id | working_hour | nextday | in_secs | in_mins | in_hours |
+-------+------------------------+------------------------+----------+----------+-----------+--+
| 1005 | 2019-10-23 08:35:00.0 | 2019-10-24 05:25:00.0 | 37500.0 | 625.0 | 10.42 |
| 1006 | 2019-10-23 00:54:59.0 | 2019-10-24 01:39:59.0 | 44550.0 | 742.5 | 12.38 |
| 1007 | 2019-10-23 00:24:57.0 | 2019-10-24 02:30:00.0 | 46951.5 | 782.53 | 13.04 |
| 1008 | 2019-10-23 06:40:00.0 | 2019-10-24 09:45:01.0 | 48750.5 | 812.51 | 13.54 |
| 1009 | 2019-10-23 03:50:00.0 | 2019-10-24 02:10:00.0 | 40200.0 | 670.0 | 11.17 |
| 1010 | 2019-10-23 03:25:01.0 | 2019-10-24 07:00:00.0 | 49649.5 | 827.49 | 13.79 |
+-------+------------------------+------------------------+----------+----------+-----------+--+
你也可以采用这种方法
WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay,
ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 60 / 2),2) AS in_hours,
ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 2),2) AS in_mins
FROM t
WHERE nextDay IS NOT NULL;
输出
+-------+------------------------+------------------------+-----------+----------+--+
| id | working_hour | nextday | in_hours | in_mins |
+-------+------------------------+------------------------+-----------+----------+--+
| 1005 | 2019-10-23 08:35:00.0 | 2019-10-24 05:25:00.0 | 7.0 | 420.0 |
| 1006 | 2019-10-23 00:54:59.0 | 2019-10-24 01:39:59.0 | 1.28 | 76.5 |
| 1007 | 2019-10-23 00:24:57.0 | 2019-10-24 02:30:00.0 | 1.45 | 87.0 |
| 1008 | 2019-10-23 06:40:00.0 | 2019-10-24 09:45:01.0 | 8.21 | 492.5 |
| 1009 | 2019-10-23 03:50:00.0 | 2019-10-24 02:10:00.0 | 3.0 | 180.0 |
| 1010 | 2019-10-23 03:25:01.0 | 2019-10-24 07:00:00.0 | 5.21 | 312.5 |
+-------+------------------------+------------------------+-----------+----------+--+
希望对您有所帮助。
我正在尽可能简单地使用它。对我来说,它工作正常。
SELECT user_name, from_unixtime(CAST(AVG(unix_timestamp(substr(working_hours,12),"HH:mm:ss"))as bigint),"HH:mm:ss") as avg_hours FROM workinglogs1 GROUP BY user_name ORDER BY avg_hours'
在这里,我使用 substr(working_hours,12) 从工作时间只计算 HH:mm:ss,然后找到工作时间的 unix_time 戳记。之后我取平均值并使用 from_unixtime.
转换为时间戳
我的数据集如下所示:-
Id working_hour
1005 2019-10-23 08:35:00
1006 2019-10-23 00:54:59
1007 2019-10-23 00:24:57
1008 2019-10-23 06:40:00
1009 2019-10-23 03:50:00
1010 2019-10-23 03:25:01
1005 2019-10-24 05:25:00
1006 2019-10-24 01:39:59
1007 2019-10-24 02:30:00
1008 2019-10-24 09:45:01
1010 2019-10-24 07:00:00
这是两天的数据集(23/10/2019 和 24/10/2019)。我想要找到每个 Id 的平均工作时间(以小时或分钟为单位)。
赞:-
Id in_hours in_mins
1005 7 420 # (08:35+3:35)/2 = 7 hours
1006 1.29 77.4835 # (00:54:59+01:39:59)/2 = 1.29 hours
使用窗口函数。超前和滞后特别有助于此用例。我没有执行此 sql 但概念是存在的。
Select (id, working_ho, nextwH)
from (
Select id, working_hour, lead(working_hour) over partition_by id order_by working hour) nextWH
from tableA)
这将产生如下所示的数据。
编号 |working_hour |下一个WH
1005|2019-10-23 08:35:00|2019-10-24 05:25:00
1005|2019-10-24 05:25:00|null
然后过滤掉nextWH为null的记录,用日期时间函数根据自己的喜好计算working_hour和nextWH的差值
这里是 link 窗口函数文档。
您可以尝试以下方法
你的数据 working_hour 作为时间戳
+------------------+----------------------------+--+
| working_hour.id | working_hour.working_hour |
+------------------+----------------------------+--+
| 1005 | 2019-10-23 08:35:00.0 |
| 1006 | 2019-10-23 00:54:59.0 |
| 1007 | 2019-10-23 00:24:57.0 |
| 1008 | 2019-10-23 06:40:00.0 |
| 1009 | 2019-10-23 03:50:00.0 |
| 1010 | 2019-10-23 03:25:01.0 |
| 1005 | 2019-10-24 05:25:00.0 |
| 1006 | 2019-10-24 01:39:59.0 |
| 1007 | 2019-10-24 02:30:00.0 |
| 1008 | 2019-10-24 09:45:01.0 |
| 1009 | 2019-10-24 02:10:00.0 |
| 1010 | 2019-10-24 07:00:00.0 |
+------------------+----------------------------+--+
用窗口函数引导并转换时间戳以秒为单位,计算两个时间戳以秒为单位的差值并将秒转换为分钟和小时。
WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay,
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 2, 2) AS in_secs, --AVG in seconds
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 2,2) AS in_mins, --AVG in minutes
ROUND((unix_timestamp(nextDay) - unix_timestamp(working_hour)) / 60 / 60 / 2,2) AS in_hours --AVG in hours
FROM t
WHERE nextDay IS NOT NULL;
并输出
+-------+------------------------+------------------------+----------+----------+-----------+--+
| id | working_hour | nextday | in_secs | in_mins | in_hours |
+-------+------------------------+------------------------+----------+----------+-----------+--+
| 1005 | 2019-10-23 08:35:00.0 | 2019-10-24 05:25:00.0 | 37500.0 | 625.0 | 10.42 |
| 1006 | 2019-10-23 00:54:59.0 | 2019-10-24 01:39:59.0 | 44550.0 | 742.5 | 12.38 |
| 1007 | 2019-10-23 00:24:57.0 | 2019-10-24 02:30:00.0 | 46951.5 | 782.53 | 13.04 |
| 1008 | 2019-10-23 06:40:00.0 | 2019-10-24 09:45:01.0 | 48750.5 | 812.51 | 13.54 |
| 1009 | 2019-10-23 03:50:00.0 | 2019-10-24 02:10:00.0 | 40200.0 | 670.0 | 11.17 |
| 1010 | 2019-10-23 03:25:01.0 | 2019-10-24 07:00:00.0 | 49649.5 | 827.49 | 13.79 |
+-------+------------------------+------------------------+----------+----------+-----------+--+
你也可以采用这种方法
WITH t AS(
SELECT id, working_hour, LEAD(working_hour) OVER(PARTITION BY id ORDER BY working_hour) AS nextDay
FROM working_hour
) SELECT id, working_hour, nextDay,
ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 60 / 2),2) AS in_hours,
ROUND( ((hour(nextDay) * 60 + minute(nextDay) + hour(working_hour) * 60 + minute(working_hour)) / 2),2) AS in_mins
FROM t
WHERE nextDay IS NOT NULL;
输出
+-------+------------------------+------------------------+-----------+----------+--+
| id | working_hour | nextday | in_hours | in_mins |
+-------+------------------------+------------------------+-----------+----------+--+
| 1005 | 2019-10-23 08:35:00.0 | 2019-10-24 05:25:00.0 | 7.0 | 420.0 |
| 1006 | 2019-10-23 00:54:59.0 | 2019-10-24 01:39:59.0 | 1.28 | 76.5 |
| 1007 | 2019-10-23 00:24:57.0 | 2019-10-24 02:30:00.0 | 1.45 | 87.0 |
| 1008 | 2019-10-23 06:40:00.0 | 2019-10-24 09:45:01.0 | 8.21 | 492.5 |
| 1009 | 2019-10-23 03:50:00.0 | 2019-10-24 02:10:00.0 | 3.0 | 180.0 |
| 1010 | 2019-10-23 03:25:01.0 | 2019-10-24 07:00:00.0 | 5.21 | 312.5 |
+-------+------------------------+------------------------+-----------+----------+--+
希望对您有所帮助。
我正在尽可能简单地使用它。对我来说,它工作正常。
SELECT user_name, from_unixtime(CAST(AVG(unix_timestamp(substr(working_hours,12),"HH:mm:ss"))as bigint),"HH:mm:ss") as avg_hours FROM workinglogs1 GROUP BY user_name ORDER BY avg_hours'
在这里,我使用 substr(working_hours,12) 从工作时间只计算 HH:mm:ss,然后找到工作时间的 unix_time 戳记。之后我取平均值并使用 from_unixtime.
转换为时间戳