使用 impala/ 配置单元中的字母顺序拆分 table
Splitting table using alphabet order in impala/ hive
我有以下 FNAMES
table(它包含大约 58k 条记录)
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
我想将 table 拆分为包含相同首字母的所有记录,如下所示:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 4 | Avi |
| 2 | Dudi |
| 5 | DAVE |
| 3 | Moshiko |
....
对于每个拆分,我想找到具有最小 Jaro–Winkler distance
的记录。这意味着对于每个以 'a' 开头的字母,我会找到最相似的记录。
我必须在以下代码中更改什么?
select FNAMES.* , MIN(Jaro–Winkler(FNAMES.NICKNAMES, FNAMES.NICKNAMES))
from FNAMES
LEFT OUTER JOIN FNAMES
ON(true)
WHERE Jaro–Winkler (FNAMES.NICKNAMES, FNAMES.NICKNAMES) <= 4
GROUP BY FNAMES.NICKNAMES
像这样
select f1.nicknames
,f2.nicknames
from (select f1.nicknames
,f2.nicknames
,rank () over
(
partition by f1.nicknames
order by jaro–winkler(f1.nicknames,f2.nicknames) desc
) as rnk
from fnames f1
left join fnames f2
on substr(f1.nicknames,1,1) =
substr(f2.nicknames,1,1)
where f1.nicknames < f2.nicknames
) t
where rnk = 1
我有以下 FNAMES
table(它包含大约 58k 条记录)
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
我想将 table 拆分为包含相同首字母的所有记录,如下所示:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 4 | Avi |
| 2 | Dudi |
| 5 | DAVE |
| 3 | Moshiko |
....
对于每个拆分,我想找到具有最小 Jaro–Winkler distance
的记录。这意味着对于每个以 'a' 开头的字母,我会找到最相似的记录。
我必须在以下代码中更改什么?
select FNAMES.* , MIN(Jaro–Winkler(FNAMES.NICKNAMES, FNAMES.NICKNAMES))
from FNAMES
LEFT OUTER JOIN FNAMES
ON(true)
WHERE Jaro–Winkler (FNAMES.NICKNAMES, FNAMES.NICKNAMES) <= 4
GROUP BY FNAMES.NICKNAMES
像这样
select f1.nicknames
,f2.nicknames
from (select f1.nicknames
,f2.nicknames
,rank () over
(
partition by f1.nicknames
order by jaro–winkler(f1.nicknames,f2.nicknames) desc
) as rnk
from fnames f1
left join fnames f2
on substr(f1.nicknames,1,1) =
substr(f2.nicknames,1,1)
where f1.nicknames < f2.nicknames
) t
where rnk = 1