如何使用 Cloud Function 在 Python 中为 BigQuery 存储传递多个定界符
How to pass multiple delimiter in Python for BigQuery storage using Cloud Function
我正在尝试将多个 csv 文件加载到 BigQuery table。对于某些 csv 文件,分隔符是逗号,而对于某些文件,分隔符是分号。有没有办法在作业配置中传递多个定界符。
job_config = bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.CSV,
field_delimiter=",",
write_disposition="WRITE_APPEND",
skip_leading_rows=1,
)
谢谢
里兹
我为此在 Cloud Functions 中部署了以下代码。我使用“Cloud Storage”作为触发器,“Finalize/Create”作为事件类型。该代码成功用于 运行 逗号和分号分隔文件上的 Bigquery 加载作业。
main.py
def hello_gcs(event, context):
from google.cloud import bigquery
from google.cloud import storage
import subprocess
# Construct a BigQuery client object.
client = bigquery.Client()
client1 = storage.Client()
bucket = client1.get_bucket('Bucket-Name')
blob = bucket.get_blob(event['name'])
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "ProjectID.DatasetName.TableName"
with open("/tmp/z", "wb") as file_obj:
blob.download_to_file(file_obj)
subprocess.call(["sed", "-i", "-e", "s/;/,/", "/tmp/z"])
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
field_delimiter=",",
write_disposition="WRITE_APPEND",
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
with open("/tmp/z", "rb") as source_file:
source_file.seek(0)
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
# Make an API request.
job.result() # Waits for the job to complete.
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud
google-cloud-bigquery
google-cloud-storage
在这里,我用“;”代替与“,”一起使用 Sed
命令。需要注意的一点是,在 Cloud Functions 中写入文件时,我们需要将路径指定为 /tmp/file_name
,因为它是 Cloud Functions 中唯一允许写入文件的位置。它还假定文件中除了分隔符之外没有其他逗号或分号。
我正在尝试将多个 csv 文件加载到 BigQuery table。对于某些 csv 文件,分隔符是逗号,而对于某些文件,分隔符是分号。有没有办法在作业配置中传递多个定界符。
job_config = bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.CSV,
field_delimiter=",",
write_disposition="WRITE_APPEND",
skip_leading_rows=1,
)
谢谢 里兹
我为此在 Cloud Functions 中部署了以下代码。我使用“Cloud Storage”作为触发器,“Finalize/Create”作为事件类型。该代码成功用于 运行 逗号和分号分隔文件上的 Bigquery 加载作业。
main.py
def hello_gcs(event, context):
from google.cloud import bigquery
from google.cloud import storage
import subprocess
# Construct a BigQuery client object.
client = bigquery.Client()
client1 = storage.Client()
bucket = client1.get_bucket('Bucket-Name')
blob = bucket.get_blob(event['name'])
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "ProjectID.DatasetName.TableName"
with open("/tmp/z", "wb") as file_obj:
blob.download_to_file(file_obj)
subprocess.call(["sed", "-i", "-e", "s/;/,/", "/tmp/z"])
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
field_delimiter=",",
write_disposition="WRITE_APPEND",
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
with open("/tmp/z", "rb") as source_file:
source_file.seek(0)
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
# Make an API request.
job.result() # Waits for the job to complete.
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud
google-cloud-bigquery
google-cloud-storage
在这里,我用“;”代替与“,”一起使用 Sed
命令。需要注意的一点是,在 Cloud Functions 中写入文件时,我们需要将路径指定为 /tmp/file_name
,因为它是 Cloud Functions 中唯一允许写入文件的位置。它还假定文件中除了分隔符之外没有其他逗号或分号。