Google 数据存储：ndb.put_multi 未返回

Question

我目前正在使用 NDB 库将 XML 文件中的一些实体重新插入到 Google 数据存储中。我观察到的问题是有时 ndb.put_multi() 似乎没有 return 并且脚本挂起等待它。

代码基本上执行以下操作：

@ndb.toplevel
def insertAll(entities):
    ndb.put_multi(entities)

entities = []
for event, case in tree:
    removeNamespace(case)
    if (case.tag == "MARKGR" and event == "end"):
        # get ndb.Model entities
        tm, app, rep = decodeTrademark(case)

        entities.append(tm)
        for app_et in app:
            entities.append(app_et)
        for rep_et in rep:
            entities.append(rep_et)
        if (len(entities) > 200):
            n_entitites += len(entities)
            insertAll(entities)
            entities = []

if(len(entities) > 0):
    insertAll(entities)

我以前注意到过这种行为，但它似乎是不确定的，我想知道是否有办法正确调试 and/or 在 ndb.put_multi 上设置超时，这样我就可以如果在给定时间后没有 return，至少重试。

提前致谢，

Answer 1

原始答案（OP 编辑前）

你的逻辑有问题。 insertAll() 可能永远不会被调用。 app 和 rep 在哪里定义？如果它们是在这个函数之外定义的，为什么它们在嵌套循环中？ rep 中的任何实体被写入 len(app) * len(tree) 次！

另外，len(entities) < 200 的情况呢？那是在 3 个嵌套循环中。肯定会有迭代 len(entities) < 200 的情况。如果在所有循环之后总数为 750，请考虑孤立实体。您将孤立 150 个实体。

至少在循环运行之后追加这个，以写入孤立实体 (< 200):

if len(entities) > 0:
    insertAll(entities)

也尝试将 200 减少到一个较小的值，例如 100。根据实体的大小，200 可能太多而无法在超时前完成。

您是否检查过是否写入了任何实体？

此外，您确定您了解数据存储所使用的 entity 是什么吗？如果您只是简单地从 XML 文件中提取字符串，则这些字符串不是实体。 rep 和 app 必须是数据存储实体的列表，tm 必须是实际的数据存储实体。

更新：

好的，这更有意义，但您仍然孤立了一些实体，并且无法控制 put_multi() 的大小。而不是 if (len(entities) > 200):，你应该批处理它们：

# primitive way to batch in groups of 100
batch_size = 100
num_full_batches = len(entities) // batch_size
remaining_count = len(entities) % batch_size

for i in range(num_full_batches):
    ndb.put_multi(entities[i * batch_size : (i+1) * batch_size])

if remaining_count > 0:
    ndb.put_multi(entities[(i+1) * batch_size:])

如果实体太多，您应该将其发送给 taskqueue

Answer 2

从您之前留下的评论来看，此应用程序似乎达到实体 read/write 限制，即 1 op/s。您可以阅读有关数据存储区限制的更多信息 here.

作为替代方案，您可以尝试使用 Cloud Firestore，因为它 doesn't have some of these limits 在 Datastore 模式下使用时。

Answer 3

基于 Ikai Lan 的 "App Engine datastore tip: monotonically increasing values are bad"。

单调递增的值 是那些stored/read/written/strictly 顺序的值，例如日志中的时间戳。在当前的 Datastore 实现中，它们将 stored/read/written 依次位于同一 location/spot 中，并且 Datastore 将无法正确拆分工作负载。因此，当 OPS 足够高并且 Datastore 无法水平增长时，您会注意到速度变慢。这叫做hotspoiting.

最重要的是那个数据存储 creates an index for each indexable property, except for example Text property，这意味着您可以在某个时候拥有各种热点。

解决方法

官方文档中提到的解决方法之一是在索引值前面加上哈希值：

If you do have a key or indexed property that will be monotonically increasing then you can prepend a random hash to ensure that the keys are sharded onto multiple tablets.

阅读更多关于 “高 read/write 比率到狭窄的关键范围 ".

Google 数据存储：ndb.put_multi 未返回

Google Datastore: ndb.put_multi not returning

python

google-app-engine

app-engine-ndb

google-cloud-datastore

解决方法