"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" 在没有架构的情况下将数据发送到 BigQuery

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

我正在编写一个脚本,用于向 BigQuery 发送数据帧:

load_job = bq_client.load_table_from_dataframe(
    df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)

# Wait for the load job to complete
return load_job.result() 

这工作正常,但前提是已经在 BigQuery 中定义了架构,或者我在脚本中定义了我的作业架构。如果没有定义架构,我会出现以下错误:

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'

为什么 pyarrow 会产生这个错误?除了预定义架构外,我该如何解决它?

从 pandas 转换为 Arrow 或 Parquet 时的默认行为是不允许静默数据丢失。在进行转换时可以设置一些选项,以允许不安全的转换导致时间戳精度丢失或其他形式的数据丢失。 BigQuery Python API 需要设置这些选项,所以它可能是 BigQuery 库中的错误。我建议报告他们的问题跟踪器 https://github.com/googleapis/google-cloud-python

我认为出现这些错误是因为 BigQuery 库使用的 pyarrow.parquet 模块确实将 Python 的内置日期时间或时间类型转换为 BigQuery 默认识别的内容,但 BigQuery 库确实有自己的方法来转换 pandas 类型 .

通过将 datetime.datetime 或 time.time 的所有实例更改为 pandas.Timestamp,我能够让它上传时间戳。例如:

my_df['timestamp'] = datetime.utcnow()

需要改为

my_df['timestamp'] = pd.Timestamp.now()

我遇到了这个错误:ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: 1602633600999999998

当我检查数据框时,我看到了这样的值:2021-09-30 23:59:59.999999998

您的日期字段可能与 bigquery 默认值不匹配。 然后我使用了这段代码:

df['date_column'] =df['date_column'].astype('datetime64[s]')

然后我的问题就解决了。

在我对 https://github.com/googleapis/python-bigquery-pandas/pull/413 的测试中,通过升级到 pandas 1.1.0+ 解决了这个问题。

正在查看 pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

我的解决方案是将以下 kwargs 添加到 to_parquet:

parquet_args = {
    'coerce_timestamps': 'us',
    'allow_truncated_timestamps': True,
}

你必须同时设置它们。如果你只设置 allow_truncated_timestamps,如果 coerce_timestampsNone,它仍然会引发错误。我认为这个想法是,如果您明确要求强制,您只想抑制错误。无论如何,文档对此很清楚,但这种行为对我来说并不明显。

如果使用 write_dataset,使用 file_options 消除此错误的示例代码:

import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
file_options = parquet_format.make_write_options(coerce_timestamps='us', allow_truncated_timestamps=True)

ds.write_dataset(..., file_options=file_options)

已添加,因为任何查询标题中 PyArrow 错误的人都会在这里结束。