Athena 查询在控制台中有效,但在 sagemaker 中不适用于 boto3 客户端(将 csv 转换为 table)
Athena query works in console but not with boto3 client in sagemaker (convert csv into table)
我正在尝试将 csv 文件从 s3 转换为 Athena 中的 table。当我 运行 在 Athena 控制台上查询时它有效,但是当我 运行 在带有 boto3 客户端的 Sagemaker Jupyter 笔记本上时它 returns:
"**InvalidRequestException**: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:8: no viable alternative at input 'CREATE EXTERNAL'"
这是我的代码
def run_query(query):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
ResultConfiguration={
'OutputLocation': 's3://path/to/s3output',
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
createTable = \
"""CREATE EXTERNAL TABLE TestTable (
ID string,
CustomerId string,
Ip string,
MessageFilename string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket_name/results/csv/'
TBLPROPERTIES ("skip.header.line.count"="1")"""
response = run_query(createTable, s3_output)
print(response)
我通过 boto3 客户端以 json 格式进行 运行 查询(因此,使用 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'),效果很好,但不知何故却不行。我试过更改名称、语法、引号,但这似乎不起作用。
如有任何建议,我们将不胜感激,
谢谢!
感谢您分享完整的示例。问题在于 SERDEPROPERTIES
中的转义。如下修改 createTable
后有效
createTable = \
"""CREATE EXTERNAL TABLE testtable (
`id` string,
`customerid` string,
`ip` string,
`messagefilename` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\\"',
'escapeChar' = '\\' )
STORED AS TEXTFILE
LOCATION 's3://bucket_name/results/csv/'
TBLPROPERTIES ("skip.header.line.count"="1");"""
我正在尝试将 csv 文件从 s3 转换为 Athena 中的 table。当我 运行 在 Athena 控制台上查询时它有效,但是当我 运行 在带有 boto3 客户端的 Sagemaker Jupyter 笔记本上时它 returns:
"**InvalidRequestException**: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:8: no viable alternative at input 'CREATE EXTERNAL'"
这是我的代码
def run_query(query):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
ResultConfiguration={
'OutputLocation': 's3://path/to/s3output',
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
createTable = \
"""CREATE EXTERNAL TABLE TestTable (
ID string,
CustomerId string,
Ip string,
MessageFilename string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket_name/results/csv/'
TBLPROPERTIES ("skip.header.line.count"="1")"""
response = run_query(createTable, s3_output)
print(response)
我通过 boto3 客户端以 json 格式进行 运行 查询(因此,使用 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'),效果很好,但不知何故却不行。我试过更改名称、语法、引号,但这似乎不起作用。
如有任何建议,我们将不胜感激, 谢谢!
感谢您分享完整的示例。问题在于 SERDEPROPERTIES
中的转义。如下修改 createTable
后有效
createTable = \
"""CREATE EXTERNAL TABLE testtable (
`id` string,
`customerid` string,
`ip` string,
`messagefilename` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\\"',
'escapeChar' = '\\' )
STORED AS TEXTFILE
LOCATION 's3://bucket_name/results/csv/'
TBLPROPERTIES ("skip.header.line.count"="1");"""