Hive 的 NULLS LAST 函数

NULLS LAST function for Hive

我有以下选择记录的算法,按照下面写的例子,应该选择以下记录。

  1. 如果“issuedate”为空列,则取“publid”,其中有 更多“客栈”。

  2. 如果“发布日期”不完全相同,则我们取“发布日期”= 最后日期。

  3. 如果“发布日期”都相等,则我们取“操作日期”= 最后日期。

  4. 如果"issuedate"和operdate相等,那么我们取"publid",它有更多的"inn"。

我在oracle中写了一段代码,想运行到hive中,但是出现错误。我认为这是因为 NULLS LAST 函数。请告诉我如何将代码中的 NULLS LAST 函数更改为 Hive 的正确函数。

示例

| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 |   1    |    12     |  01-01-21 | 05-01-21 |
| 222 |   1    |    12     |  01-01-21 | 05-01-21 |
| 333 |   2    |    12     |  01-01-21 | 05-01-21 | 
| 222 |   2    |    12     |  01-01-21 | 05-01-21 |
| 111 |   2    |    12     |  01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 |   1    |     1     |  01-01-21 |          |
| 456 |   1    |     1     |  01-01-21 |          |
| 123 |   2    |     1     |  03-01-21 |          |
| 456 |   2    |     1     |  03-01-21 |          | 
| 789 |   2    |     1     |  03-01-21 |          |
| 123 |   3    |     1     |  02-01-21 |          |
| 456 |   3    |     1     |  02-01-21 |          |
|-----|--------|-----------|-----------|----------|
| 123 |   1    |     1     |           | 01-01-21 |
| 456 |   1    |     1     |           | 01-01-21 |
| 123 |   2    |     1     |           | 03-01-21 |
| 456 |   2    |     1     |           | 03-01-21 | 
| 789 |   2    |     1     |           | 03-01-21 |
| 123 |   3    |     1     |           | 02-01-21 |
| 456 |   3    |     1     |           | 02-01-21 |

结果

| inn | publid | clusterid | issuedate | operdate |
|-----|--------|-----------|-----------|----------|
| 333 |   2    |    12     |  01-01-21 | 05-01-21 |
| 222 |   2    |    12     |  01-01-21 | 05-01-21 |
| 111 |   2    |    12     |  01-01-21 | 05-01-21 |
|-----|--------|-----------|-----------|----------|
| 123 |   2    |     1     |  03-01-21 |          |
| 456 |   2    |     1     |  03-01-21 |          |
| 789 |   2    |     1     |  03-01-21 |          |
|-----|--------|-----------|-----------|----------|
| 123 |   2    |     1     |           | 03-01-21 |
| 456 |   2    |     1     |           | 03-01-21 |
| 789 |   2    |     1     |           | 03-01-21 |
    SELECT inn,
       publid,
       clusterid,
       issuedate,
       operdate
FROM   (
  SELECT inn,
         publid,
         clusterid,
         issuedate,
         operdate,
         DENSE_RANK() OVER (
           PARTITION BY clusterid
           ORDER     BY COALESCE( issuedate, operdate ) DESC NULLS LAST,
                        cnt DESC
         ) AS rnk
  FROM   (
    SELECT t.*,
           COUNT(inn) OVER (PARTITION BY publid) cnt
    FROM   table_name t
    WHERE  clusterid is not null
  )
)
WHERE  rnk = 1;

只需在 ORDER BY

中再添加一个表达式

替换为:

ORDER BY COALESCE( issuedate, operdate ) DESC NULLS LAST

有了这个:

ORDER BY CASE WHEN COALESCE(issuedate, operdate) is NOT NULL THEN 1 ELSE 2 END, --acts as NULLS LAST
         COALESCE( issuedate, operdate ) DESC

另外根据这个 Jira:HIVE-12994 目前 NULLS FIRST 是 ASC 顺序的默认值,NULLS LAST 是 DESC 顺序的默认值,您可能可以删除 NULLS LAST,它将作为 DESC 顺序的默认值。需要仔细检查。