从 Hive 中的列字段中提取 Key:Value

Question

我目前是 learning/testing Hive，似乎无法找到解决此问题的 suitable 解决方案：我有这样的日志文件：

IP, Date, Time, URL, Useragent

我目前在 Table 这些专栏中。这些列由 '\t' 分隔，但 URL 已获得一些特定的客户信息，看起来有点像这样：

example.org/log.gif?userID=xxx&sex=m&age=y&subscriber=y&lastlogin=ddd

我想用这些给定的值对创建一个新的 table：userID, sex, age, subscriber, lastlogin 另一个问题是值对并不总是完整的，或者有些缺失。像这样：

example.org/log.gif?userID=xxx&sex=m&age=y&subscriber=y&lastlogin=ddd

example.org/log.gif?userID=xxx&sex=m&age=y&lastlogin=

这使得 Hive 的 ... format delimited fields terminated by '&'; afaik 在这种情况下毫无用处，因为它会导致列中出现错误的值。

有没有办法用 SQL 和正则表达式解决 Hive 中的这个问题？

Answer 1

这可以完成，尽管有两个 Hive tables。您首先将数据加载到具有以下列的 table 中：

IP, Date, Time, URL, Useragent

这里我建议使用 EXTERNAL Hive table - 你不是在解析数据，而且这个 Hive table 不需要存在很长时间，所以简单地放置在其之上配置单元元数据：

CREATE EXTERNAL TABLE raw_log (
  ip                string,
  date              string,
  time              string,
  url               string,
  useragent         string
)
LOCATION '<hdfs_location_of_the_raw_log_folder>';

然后使用带有 Hive regexp_extract(string subject, string pattern, int index) 方法的 INSERT INTO 查询（参见 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF）将其加载到具有正确列的 "final" table .

您还可以编写自己的 UDF，这将使您能够更好地处理您提到的 incomplete/missing 值，尽管需要权衡每次都必须重新编译和重新部署 JAR输入数据的格式发生变化（参见 https://cwiki.apache.org/confluence/display/Hive/HivePlugins）。

从 Hive 中的列字段中提取 Key:Value

Extracting Key:Value from column fields in Hive

regex

mysql

sql

hadoop

hive