在 pyarrow 测试中使用内存文件系统

Question

我有一些pyarrow Parquet 数据集编写代码。我想要一个集成测试来确保文件被正确写入。我想通过将一个小示例数据块写入内存文件系统来做到这一点。但是，我正在努力为 Python.

寻找 pyarrow 兼容的内存文件系统接口

您会在下面找到一段代码，其中包含一个 filesystem 变量。我想用内存中的文件系统替换 filesystem 变量，稍后我可以在集成测试中以编程方式检查它。

import pyarrow.parquet as pq
pq.write_to_dataset(
        score_table,
        root_path=AWS_ZEBRA_OUTPUT_S3_PREFIX,
        filesystem=filesystem,
        partition_cols=[
            EQF_SNAPSHOT_YEAR_PARTITION,
            EQF_SNAPSHOT_MONTH_PARTITION,
            EQF_SNAPSHOT_DAY_PARTITION,
            ZEBRA_COMPUTATION_TIMESTAMP
        ]
    )

Answer 1

如果 filesystem 是 None，您可以将 in-memory 文件对象传递给 write_to_dataset。

所以你的电话可能会变成：

from io import BytesIO
import pyarrow.parquet as pq

with BytesIO() as f:
    pq.write_to_dataset(
        score_table,
        root_path=f,
        filesystem=None,
        partition_cols=[
            EQF_SNAPSHOT_YEAR_PARTITION,
            EQF_SNAPSHOT_MONTH_PARTITION,
            EQF_SNAPSHOT_DAY_PARTITION,
            ZEBRA_COMPUTATION_TIMESTAMP
        ]
    )

pyarrow 来源的相关行：

def resolve_filesystem_and_path(where, filesystem=None):
    """
    Return filesystem from path which could be an HDFS URI, a local URI,
    or a plain filesystem path.
    """
    if not _is_path_like(where):
        if filesystem is not None:
            raise ValueError("filesystem passed but where is file-like, so"
                             " there is nothing to open with filesystem.")
        return filesystem, where

https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/filesystem.py#L402-L411

Answer 2

最后，我手动实现了 pyarrow.FileSystem ABC 的一个实例。似乎使用 mock 进行测试失败了，因为 pyarrow（不是以最 Pythonic 的方式）检查传递给 write_to_dataset 的 filesystem 参数的类型：https://github.com/apache/arrow/blob/5e201fed061f2a95e66889fa527ae8ef547e9618/python/pyarrow/filesystem.py#L383。我建议更改此方法中的逻辑，使其不显式检查类型（甚至 isinstance 更可取！），以便更轻松地进行测试。

在 pyarrow 测试中使用内存文件系统

Using in-memory filesystem in `pyarrow` tests

python

filesystems

parquet

pyarrow