读取非定界 asciif 文件 Apache Pig Latin
Read non delimited asciif file Apache Pig Latin
我正在尝试读取 Apache Pig Latin 中的文本文件,该文件的每一行都包含非分隔的 ascii。也就是说,该行中的每一列都在行中的特定位置开始和结束。
示例定义:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
示例数据:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
预期输出:
sample, data, hi
dude, hi, bro
如何在 Pig 中阅读此内容? PigStorage 似乎不够灵活,无法允许位置分隔,只能使用字符串分隔(逗号、制表符等)。
看起来 Apache 为这个特定用例提供了一个加载器:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);
我正在尝试读取 Apache Pig Latin 中的文本文件,该文件的每一行都包含非分隔的 ascii。也就是说,该行中的每一列都在行中的特定位置开始和结束。
示例定义:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
示例数据:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
预期输出:
sample, data, hi
dude, hi, bro
如何在 Pig 中阅读此内容? PigStorage 似乎不够灵活,无法允许位置分隔,只能使用字符串分隔(逗号、制表符等)。
看起来 Apache 为这个特定用例提供了一个加载器:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);