雪花 "Invalid UTF8 detected in string SNOWFLAKE" 使用 PUT
Snowflake "Invalid UTF8 detected in string SNOWFLAKE" using PUT
我有一个包含以下内容的 csv 文件:
"Compañía","Aeropuerto Base","Año","Clase","Grupo Compañía","Mes","Movimiento","País","Servicio","Tipo Avión","Tipo Tráfico","Operaciones Totales"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE SCHENGEN","Total","","","","","","","0"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE NO SCHENGEN","Total","","","","","","","4"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","INTERNACIONAL","Total","","","","","","","2"
我已经使用舞台功能将其上传到雪花:
PUT 'file://C:\tmp\opc2020.csv' @demo_stage;
我创建了一个文件格式:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',';
如果我尝试查询内容:
SELECT C. FROM @demo_stage (file_format => 'demo_file_format') C
我收到一个错误:
SQL Error [100144] [22000]: Invalid UTF8 detected in string '0xFF0xFE"0x00C0x00o0x00m0x00p0x00a0x000xF10x000xED0x00a0x00"0x00'
File 'opc2020.csv.gz', line 1, character 1
Row 1, column "TRANSIENT_STAGE_TABLE"["":1]
如果我添加 VALIDATE_UTF8 = false 属性,那么我可以查询该阶段但会丢失 UTF8 字符并且字符之间有一些意想不到的空格:
CREATE OR REPLACE FILE FORMAT dbt_demo_file_format TYPE = 'CSV' field_delimiter = ',' VALIDATE_UTF8 = TRUE;
��" C o m p a � � a " |
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
我该如何解决这个问题?
如果原始文件是使用不同于 UTF-8 的字符集生成的,那么您就会遇到这个问题。
如果您知道用于生成文件的字符集,那么您可以在 FILE FORMAT 语句中将 ENCODING 参数设置为正确的值。
在您的情况下,如果原始文件是使用 UTF-16LE 创建的,那么您的 CREATE FILE FORMAT 将如下所示:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',' ENCODING = 'UTF-16 LE' VALIDATE_UTF8 = TRUE FIELD_OPTIONALLY_ENCLOSED_BY = '"';
有关字符集和编码的更多信息,请参阅我们的文档 here。
我有一个包含以下内容的 csv 文件:
"Compañía","Aeropuerto Base","Año","Clase","Grupo Compañía","Mes","Movimiento","País","Servicio","Tipo Avión","Tipo Tráfico","Operaciones Totales"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE SCHENGEN","Total","","","","","","","0"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE NO SCHENGEN","Total","","","","","","","4"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","INTERNACIONAL","Total","","","","","","","2"
我已经使用舞台功能将其上传到雪花:
PUT 'file://C:\tmp\opc2020.csv' @demo_stage;
我创建了一个文件格式:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',';
如果我尝试查询内容:
SELECT C. FROM @demo_stage (file_format => 'demo_file_format') C
我收到一个错误:
SQL Error [100144] [22000]: Invalid UTF8 detected in string '0xFF0xFE"0x00C0x00o0x00m0x00p0x00a0x000xF10x000xED0x00a0x00"0x00'
File 'opc2020.csv.gz', line 1, character 1
Row 1, column "TRANSIENT_STAGE_TABLE"["":1]
如果我添加 VALIDATE_UTF8 = false 属性,那么我可以查询该阶段但会丢失 UTF8 字符并且字符之间有一些意想不到的空格:
CREATE OR REPLACE FILE FORMAT dbt_demo_file_format TYPE = 'CSV' field_delimiter = ',' VALIDATE_UTF8 = TRUE;
��" C o m p a � � a " |
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
我该如何解决这个问题?
如果原始文件是使用不同于 UTF-8 的字符集生成的,那么您就会遇到这个问题。
如果您知道用于生成文件的字符集,那么您可以在 FILE FORMAT 语句中将 ENCODING 参数设置为正确的值。
在您的情况下,如果原始文件是使用 UTF-16LE 创建的,那么您的 CREATE FILE FORMAT 将如下所示:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',' ENCODING = 'UTF-16 LE' VALIDATE_UTF8 = TRUE FIELD_OPTIONALLY_ENCLOSED_BY = '"';
有关字符集和编码的更多信息,请参阅我们的文档 here。