Hive 的 NULLS LAST 函数
NULLS LAST function for Hive
我有以下选择记录的算法,按照下面写的例子,应该选择以下记录。
如果“issuedate”为空列,则取“publid”,其中有
更多“客栈”。
如果“发布日期”不完全相同,则我们取“发布日期”= 最后日期。
如果“发布日期”都相等,则我们取“操作日期”= 最后日期。
如果"issuedate"和operdate相等,那么我们取"publid",它有更多的"inn"。
我在oracle中写了一段代码,想运行到hive中,但是出现错误。我认为这是因为 NULLS LAST 函数。请告诉我如何将代码中的 NULLS LAST 函数更改为 Hive 的正确函数。
示例
| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 | 1 | 12 | 01-01-21 | 05-01-21 |
| 222 | 1 | 12 | 01-01-21 | 05-01-21 |
| 333 | 2 | 12 | 01-01-21 | 05-01-21 |
| 222 | 2 | 12 | 01-01-21 | 05-01-21 |
| 111 | 2 | 12 | 01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 | 1 | 1 | 01-01-21 | |
| 456 | 1 | 1 | 01-01-21 | |
| 123 | 2 | 1 | 03-01-21 | |
| 456 | 2 | 1 | 03-01-21 | |
| 789 | 2 | 1 | 03-01-21 | |
| 123 | 3 | 1 | 02-01-21 | |
| 456 | 3 | 1 | 02-01-21 | |
|-----|--------|-----------|-----------|----------|
| 123 | 1 | 1 | | 01-01-21 |
| 456 | 1 | 1 | | 01-01-21 |
| 123 | 2 | 1 | | 03-01-21 |
| 456 | 2 | 1 | | 03-01-21 |
| 789 | 2 | 1 | | 03-01-21 |
| 123 | 3 | 1 | | 02-01-21 |
| 456 | 3 | 1 | | 02-01-21 |
结果
| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 | 2 | 12 | 01-01-21 | 05-01-21 |
| 222 | 2 | 12 | 01-01-21 | 05-01-21 |
| 111 | 2 | 12 | 01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 | 2 | 1 | 03-01-21 | |
| 456 | 2 | 1 | 03-01-21 | |
| 789 | 2 | 1 | 03-01-21 | |
|-----|--------|-----------|-----------|----------|
| 123 | 2 | 1 | | 03-01-21 |
| 456 | 2 | 1 | | 03-01-21 |
| 789 | 2 | 1 | | 03-01-21 |
SELECT inn,
publid,
clusterid,
issuedate,
operdate
FROM (
SELECT inn,
publid,
clusterid,
issuedate,
operdate,
DENSE_RANK() OVER (
PARTITION BY clusterid
ORDER BY COALESCE( issuedate, operdate ) DESC NULLS LAST,
cnt DESC
) AS rnk
FROM (
SELECT t.*,
COUNT(inn) OVER (PARTITION BY publid) cnt
FROM table_name t
WHERE clusterid is not null
)
)
WHERE rnk = 1;
只需在 ORDER BY
中再添加一个表达式
替换为:
ORDER BY COALESCE( issuedate, operdate ) DESC NULLS LAST
有了这个:
ORDER BY CASE WHEN COALESCE(issuedate, operdate) is NOT NULL THEN 1 ELSE 2 END, --acts as NULLS LAST
COALESCE( issuedate, operdate ) DESC
另外根据这个 Jira:HIVE-12994 目前 NULLS FIRST 是 ASC 顺序的默认值,NULLS LAST 是 DESC 顺序的默认值,您可能可以删除 NULLS LAST,它将作为 DESC 顺序的默认值。需要仔细检查。
我有以下选择记录的算法,按照下面写的例子,应该选择以下记录。
如果“issuedate”为空列,则取“publid”,其中有 更多“客栈”。
如果“发布日期”不完全相同,则我们取“发布日期”= 最后日期。
如果“发布日期”都相等,则我们取“操作日期”= 最后日期。
如果"issuedate"和operdate相等,那么我们取"publid",它有更多的"inn"。
我在oracle中写了一段代码,想运行到hive中,但是出现错误。我认为这是因为 NULLS LAST 函数。请告诉我如何将代码中的 NULLS LAST 函数更改为 Hive 的正确函数。
示例
| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 | 1 | 12 | 01-01-21 | 05-01-21 |
| 222 | 1 | 12 | 01-01-21 | 05-01-21 |
| 333 | 2 | 12 | 01-01-21 | 05-01-21 |
| 222 | 2 | 12 | 01-01-21 | 05-01-21 |
| 111 | 2 | 12 | 01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 | 1 | 1 | 01-01-21 | |
| 456 | 1 | 1 | 01-01-21 | |
| 123 | 2 | 1 | 03-01-21 | |
| 456 | 2 | 1 | 03-01-21 | |
| 789 | 2 | 1 | 03-01-21 | |
| 123 | 3 | 1 | 02-01-21 | |
| 456 | 3 | 1 | 02-01-21 | |
|-----|--------|-----------|-----------|----------|
| 123 | 1 | 1 | | 01-01-21 |
| 456 | 1 | 1 | | 01-01-21 |
| 123 | 2 | 1 | | 03-01-21 |
| 456 | 2 | 1 | | 03-01-21 |
| 789 | 2 | 1 | | 03-01-21 |
| 123 | 3 | 1 | | 02-01-21 |
| 456 | 3 | 1 | | 02-01-21 |
结果
| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 | 2 | 12 | 01-01-21 | 05-01-21 |
| 222 | 2 | 12 | 01-01-21 | 05-01-21 |
| 111 | 2 | 12 | 01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 | 2 | 1 | 03-01-21 | |
| 456 | 2 | 1 | 03-01-21 | |
| 789 | 2 | 1 | 03-01-21 | |
|-----|--------|-----------|-----------|----------|
| 123 | 2 | 1 | | 03-01-21 |
| 456 | 2 | 1 | | 03-01-21 |
| 789 | 2 | 1 | | 03-01-21 |
SELECT inn, publid, clusterid, issuedate, operdate FROM ( SELECT inn, publid, clusterid, issuedate, operdate, DENSE_RANK() OVER ( PARTITION BY clusterid ORDER BY COALESCE( issuedate, operdate ) DESC NULLS LAST, cnt DESC ) AS rnk FROM ( SELECT t.*, COUNT(inn) OVER (PARTITION BY publid) cnt FROM table_name t WHERE clusterid is not null ) ) WHERE rnk = 1;
只需在 ORDER BY
中再添加一个表达式替换为:
ORDER BY COALESCE( issuedate, operdate ) DESC NULLS LAST
有了这个:
ORDER BY CASE WHEN COALESCE(issuedate, operdate) is NOT NULL THEN 1 ELSE 2 END, --acts as NULLS LAST
COALESCE( issuedate, operdate ) DESC
另外根据这个 Jira:HIVE-12994 目前 NULLS FIRST 是 ASC 顺序的默认值,NULLS LAST 是 DESC 顺序的默认值,您可能可以删除 NULLS LAST,它将作为 DESC 顺序的默认值。需要仔细检查。