如何计算我的数据中某些特定单词在配置单元中的出现频率？

Question

我的配置单元中有 table 个 Twitter 对象（json 格式）、n 行和 1 列。任务是统计'hon'、'han'等词在不同对象中的出现频率（每个对象都有一个名为'text'的属性，其中包括一些文本（字符串类型）），这意味着即使一个词在一个对象中出现了不止一次，但它只算一次。我写了一个如下的查询。

select count(*) from table_name
where regexp(get_json_object(col_name, '$.text'), 'han')
limit 10

并得到类似

的错误信息

FAILED: ParseException line 2:6 cannot recognize input near 'regexp' '(' 'get_json_object' in expression specification`

我该如何完成这个查询任务？而且我不知道如何忽略正则表达式中的大小写。

Answer 1

使用 (?i) 修饰符进行不区分大小写的比较：

select 
      sum(case when text rlike '(?i)han' then 1 else 0 end) cnt_han,
      sum(case when text rlike '(?i)hon' then 1 else 0 end) cnt_hon
  from
(
select get_json_object(col_name, '$.text') as text 
  from table_name
)s;

如何计算我的数据中某些特定单词在配置单元中的出现频率？

how to count the frequency of occurrence of some specific words of my data in hive?

json

hive

mapreduce