如何通过键索引存储在S3中的JSON个文件?
How to index JSON files stored in S3 by keys?
假设我想在 S3 中存储数百个 JSON 文件。所有这些 JSON 文件都具有相同的架构。我想通过键和值搜索这些 JSON 文件:例如查找所有 JSON 个文件,键 a
value = "abc*" 和键 x
value = "xyz" 。我希望搜索 return 与查询匹配的文件名和键。
通过键索引 JSON 存储在 S3 中的文件的最佳方法是什么?
这是我之前
的后续
您可能要考虑使用 S3 Select
。
With Amazon S3 Select, you can use simple structured query language
(SQL) statements to filter the contents of Amazon S3 objects and
retrieve just the subset of data that you need. By using Amazon S3
Select to filter this data, you can reduce the amount of data that
Amazon S3 transfers, which reduces the cost and latency to retrieve
this data.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache
Parquet format.
这是一篇关于如何使用 S3 Select
的精彩博客 post。
示例代码如下所示:
import boto3
# S3 bucket to query (Change this to your bucket)
S3_BUCKET = 'greg-college-data'
s3 = boto3.client('s3')
r = s3.select_object_content(
Bucket=S3_BUCKET,
Key='COLLEGE_DATA_2015.csv',
ExpressionType='SQL',
Expression="select \"INSTNM\" from s3object s where s.\"STABBR\" in ['OR', 'IA']",
InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization={'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
假设我想在 S3 中存储数百个 JSON 文件。所有这些 JSON 文件都具有相同的架构。我想通过键和值搜索这些 JSON 文件:例如查找所有 JSON 个文件,键 a
value = "abc*" 和键 x
value = "xyz" 。我希望搜索 return 与查询匹配的文件名和键。
通过键索引 JSON 存储在 S3 中的文件的最佳方法是什么?
这是我之前
您可能要考虑使用 S3 Select
。
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of Amazon S3 objects and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.
这是一篇关于如何使用 S3 Select
的精彩博客 post。
示例代码如下所示:
import boto3
# S3 bucket to query (Change this to your bucket)
S3_BUCKET = 'greg-college-data'
s3 = boto3.client('s3')
r = s3.select_object_content(
Bucket=S3_BUCKET,
Key='COLLEGE_DATA_2015.csv',
ExpressionType='SQL',
Expression="select \"INSTNM\" from s3object s where s.\"STABBR\" in ['OR', 'IA']",
InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization={'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)