使用 apache sqoop import 的多字符定界符

Question

我正在使用 apache sqoop 将数据从 teradata(RDBMS) 导入到 hive。用于导入的常用分隔符如 ",", "|", "~" 出现在表中。有没有办法在 apache sqoop 中使用多个字符作为分隔符。

为了避免这种情况，我在 sqoop import 命令中使用了 --escaped-by "\t" 和 --fields-terminated-by "," 参数。那么有没有办法 'unescape' 我在 sqoop import.

中使用的 "\t"

Answer 1

每当我遇到具有挑战性的表格时，我都会使用“\b”分隔符，这些表格包含大量数据字段，其中包含可能包含 TABS 和 CR/LF 字符的文本。 '\b' 与 BACKSPACE 一样，在大多数数据库中很难插入到字符字段中。

这是我使用的 sqoop 命令的一个例子：

            sqoop import 
              --connect "jdbc:sqlserver://myserver;DatabaseName=MyDB;user=MyUser;password=MyPassword;port=1433"
              --warehouse-dir=/user/MyUser/Import/MyDB 
              --fields-terminated-by '\b' --num-mappers 8
              --table training_deficiency 
              --hive-table stage.training_deficiency 
              --hive-import --hive-overwrite
              --hive-delims-replacement '<newline>' 
              --split-by Training_Deficiency_ID 
              --outdir /home/MyUser/sqoop/java
              --where "batch_update_dt > '2016-12-09 23:06:44.69'"

使用 apache sqoop import 的多字符定界符

Multiple character delimiter using apache sqoop import

hive

sqoop