雪花 "Invalid UTF8 detected in string SNOWFLAKE" 使用 PUT

Snowflake "Invalid UTF8 detected in string SNOWFLAKE" using PUT

我有一个包含以下内容的 csv 文件:

"Compañía","Aeropuerto Base","Año","Clase","Grupo Compañía","Mes","Movimiento","País","Servicio","Tipo Avión","Tipo Tráfico","Operaciones Totales"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE SCHENGEN","Total","","","","","","","0"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE NO SCHENGEN","Total","","","","","","","4"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","INTERNACIONAL","Total","","","","","","","2"

我已经使用舞台功能将其上传到雪花:

PUT 'file://C:\tmp\opc2020.csv' @demo_stage;

我创建了一个文件格式:

CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',';

如果我尝试查询内容:

SELECT C. FROM @demo_stage (file_format => 'demo_file_format') C

我收到一个错误:

SQL Error [100144] [22000]: Invalid UTF8 detected in string '0xFF0xFE"0x00C0x00o0x00m0x00p0x00a0x000xF10x000xED0x00a0x00"0x00'
  File 'opc2020.csv.gz', line 1, character 1
  Row 1, column "TRANSIENT_STAGE_TABLE"["":1]

如果我添加 VALIDATE_UTF8 = false 属性,那么我可以查询该阶段但会丢失 UTF8 字符并且字符之间有一些意想不到的空格:

CREATE OR REPLACE FILE FORMAT dbt_demo_file_format TYPE = 'CSV' field_delimiter = ',' VALIDATE_UTF8 = TRUE;


��" C o m p a � � a "                                            |
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     
 " 2   E X C E L   A V I A T I O N   L T D "     

我该如何解决这个问题?

如果原始文件是使用不同于 UTF-8 的字符集生成的,那么您就会遇到这个问题。

如果您知道用于生成文件的字符集,那么您可以在 FILE FORMAT 语句中将 ENCODING 参数设置为正确的值。

在您的情况下,如果原始文件是使用 UTF-16LE 创建的,那么您的 CREATE FILE FORMAT 将如下所示:

CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',' ENCODING = 'UTF-16 LE' VALIDATE_UTF8 = TRUE FIELD_OPTIONALLY_ENCLOSED_BY = '"'; 

有关字符集和编码的更多信息,请参阅我们的文档 here