Quantopian / Zipline：管道包中的奇怪模式

Question

我最近在"Pipeline" API from Quantopian/Zipline中发现了一个非常奇怪的模式：他们有一个CustomFactor class，其中你会发现一个compute()方法被覆盖时实施您自己的因子模型。

compute() 的签名是：def compute(self, today, assets, out, *inputs)，参数 "out" 的注释如下：

Output array of the same shape as assets. compute should write its desired return values into out.

当我问为什么函数不能简单地 return 一个输出数组而不是写入一个输入参数时，我收到了以下回答：

"If the API required that the output array be returned by compute(), we'd end up doing a copy of the array into the actual output buffer which means an extra copy would get made unnecessarily."

我不明白为什么他们最终会这样做...显然在 Python 中没有关于按值传递的问题，也没有不必要地复制数据的风险。这真的很痛苦，因为这是他们推荐人们编码的实现方式：

    def compute(self, today, assets, out, data):
       out[:] = data[-1]

所以我的问题是，为什么不能简单地是：

    def compute(self, today, assets, data):
       return data[-1]

Answer 1

（我在这里设计并实现了有问题的 API。）

您是对的，Python 对象在传入和传出函数时不会被复制。 return从您的 CustomFactor 中提取一行和将值写入提供的数组之间存在差异的原因与将在调用您的代码中创建的副本有关CustomFactor 计算方法。

最初设计 CustomFactor API 时，调用您的计算方法的代码大致如下所示：

def _compute(self, windows, dates, assets):
    # `windows` here is list of iterators yielding 2D slices of 
    # the user's requested inputs

    # `dates` and `assets` are row/column labels for the final output.

    # Allocate a (dates x assets) output array.
    # Each invocation of the user's `compute` function
    # corresponds to one row of output.
    output = allocate_output()

    for i in range(len(dates)):

        # Grab the next set of input arrays.
        inputs = [next(w) for w in windows]

        # Call the user's compute, which is responsible for writing
        # values into `out`.
        self.compute(
            dates[i], 
            assets,
            # This index is a non-copying operation.
            # It creates a view into row `i` of `output`.
            output[i],
            *inputs  # Unpack all the inputs.
        )

    return output

这里的基本思想是我们已经预取了大量数据，我们现在要循环 windows 进入该数据，调用用户对数据的计算函数，并将结果写入预先分配的输出数组，然后将其传递给进一步的转换。

无论我们做什么，都必须付出至少一份拷贝的代价，才能将用户compute函数的结果放到输出数组中。

正如您所指出的，最明显的 API 是让用户简单地 return 输出行，在这种情况下，调用代码如下所示：

# Get the result row from the user.
result_row = self.compute(dates[i], assets, *inputs)
# Copy the user's result into our output buffer.
output[i] = result_row

如果那是 API，那么我们必须为每次调用用户的 compute

至少支付以下费用

分配用户将 return.
用户计算数据到用户输出数组的副本。
从用户输出数组到我们自己的更大数组的副本。

利用现有的 API，我们避免了成本 (1) 和 (3)。

综上所述，我们已经对 CustomFactors 的工作方式进行了更改，使上述某些优化变得不那么有用。特别是，我们现在只将当天未屏蔽的资产的数据传递给 compute，这需要在调用 compute.[=17 之前和之后输出数组的部分副本=]

虽然仍然有一些设计原因更喜欢现有的 API。特别是，让引擎控制输出分配让我们更容易为多输出因素做 pass recarrays 等事情。

Quantopian / Zipline：管道包中的奇怪模式

Quantopian / Zipline: weird pattern in Pipeline package

python

arrays

parameter-passing

pass-by-reference

zipline