使用 Hive 进行字数统计

Word count using Hive

假设我有一个 table 列 id 和内容:

id | content
________________________
1  | abc abr abc as abs
2  | abc arc cre arc
3  | agr ann agd agd agd 

我想要的是这样的输出:

{"abc":2,"abr":1,"as":1, "abs":1}  # for id 1
{"abc":1,"arc":2,"cre":1}          # for id 2
{"agr":1,"agd":3,"ann":1}          # for id 3

如何使用 Hive 完成任务?

您需要 this 图书馆。构建起来非常简单。

查询:

ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';

SELECT id
  , COLLECT(words, c) AS count_map
FROM (
  SELECT id
    , words
    , COUNT(*) AS c
  FROM (
    SELECT id, words
    FROM db.tbl
    LATERAL VIEW EXPLODE(SPLIT(content, ' ')) exptbl AS words ) x
  GROUP BY id, words ) y
GROUP BY id

输出:

+----+---------------------------------+
|id  |count_map                        |
+----+---------------------------------+
|1   |{"as":1,"abs":1,"abc":2,"abr":1} |
+----+---------------------------------+
|2   |{"cre":1,"arc":2,"abc":1}        |
+----+---------------------------------+
|3   |{"ann":1,"agr":1,"agd":3}        |
+----+---------------------------------+