使用新行创建对象数组（ newline Json BigQuery 的标准格式转换）

Question

我创建了一个 json 对象来将其存储在 Cloud Storage 中，但我需要将其转换为换行符 Json 标准格式，以便 BigQuery 可以读取它。

这是我的代码：

  items = []
  for item in item_list:
    item = {'key': item}
    items.append(item)

实际的当前输出是这样的：

[{'item': 'stuff'}, {'item': 'stuff'}, {'item': 'stuff'}, {'item': 'stuff'}]

我需要它是这样的：

{'item': 'stuff'}
{'item': 'stuff'}
{'item': 'stuff'}
{'item': 'stuff'}

据我了解，我需要在数组中的每个对象之间添加新行 '\n'。我该怎么做？

我正在使用 upload_from_string() 方法将对象上传到云存储。

Answer 1

所以首先我们将列表转换为字符串。然后我们将“[”替换为“[\n”，因此它在方括号后有一个换行符。然后出于同样的原因，我们将“]”替换为“\n]”。最后我们将所有的 "}, " 替换为 "}, \n"

jsonString = str([{'item': 'stuff'}, {'item': 'stuff'}, {'item': 'stuff'}, {'item': 'stuff'}])

jsonString = jsonString.replace("[", "[\n")
jsonString = jsonString.replace("]", "\n]")
jsonString = jsonString.replace("}, ", "},\n")
print(jsonString)

Answer 2

BigQuery 需要 JSON-NL 格式（又名 NDJSON），它是每个文件行上的单个 JSON 对象。

基本上这意味着如果你随机选择文件的任何一行，那么你必须能够反序列化 JSON 而不需要任何其他部分。

因此，要生成文件，而不是序列化数组，您需要独立序列化每个对象。

item_list = [ ...<all items>... ]

with open('send-to-bigquery.ndjson', 'w') as out:
    for item in item_list:
        out.write(json.dumps(item))
        out.write('\n')

创建为字符串：

lines = [json.dumps(item) for item in item_list]
file_content = '\n'.join(lines)
upload_from_string(file_content)  # to GCS

使用新行创建对象数组（ newline Json BigQuery 的标准格式转换）

create array of object with new line ( newline Json standard format conversion for BigQuery)

python

json

google-cloud-storage

google-bigquery