Gremlin

Question

我试图在不使用太多 memory/time 开销的情况下将所有边 ID 从我的图形中获取到一个文本文件中。

我的第一个想法是使用惰性迭代。为此，我创建了一个遍历对象 t = g.E().id()，并在 while 循环中调用 t.next(x)。

但是对于大量的边它失败了，错误如下：

Error in /apps/external/4/.../get_edges.groovy at [24: }] - GC overhead limit exceeded

请注意，它在 while 循环内失败了，因为它确实成功地写出了 IDS 的一个子集。

这是我提交给 gremlin 控制台的脚本，它适用于小图，但在我的系统上无法用于较大的（数百万条边）图。

:remote connect tinkerpop.server conf/remote.yaml session
:remote console

chunkSize = 500
indexModToFile = 1000
idx = 0
edgesFileName = 'edges.txt'
statusFileName = 'status.txt'
new File(statusFileName).withWriter('utf-8') { def statusWriter ->
   new File(edgesFileName).withWriter('utf-8') { def edgeWriter ->
        t = g.E().id()
        def i
        while(i = t.next(chunkSize)){
            i.each { def e ->
                edgeWriter << e.toString() + '\n'
                idx += 1
            }
        }
        if ( idx % indexModToFile == 0 ) {
            statusWriter << idx.toString() + '\n'
        }
    }
}

问题：

为什么会失败？
是否有更好更快的方法来提取所有边 ID？

编辑 1

我也试过 export JAVA_OPTS="-Xms4G -Xmx6G"（仍然无效），但我不认为这对于惰性迭代器是必要的。

Answer 1

这并没有真正回答惰性迭代为何失败的问题，但这是提取 ID 的快速替代选项，也许会对某些人有所帮助。

我已经在边缘属性设置上设置了弹性索引，所以我只是查询了所有可用的文档，过滤了 ID。

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan

es = Elasticsearch()
es_response = scan(es,
                   index='janusgraph_bydate',
                   query={"query": {"match_all": {}}, "stored_fields": []})

id_lst = [item['_id'] for item in es_response]

更新

代替t = g.E()，使用t = g.E(); []
在gremlin-server.sh文件中增加-Xmx选项JAVA_OPTIONS。这覆盖了我的全局设置，一旦我将其增加到 -Xmx4096m，它就起作用了。

JAVA_OPTIONS="-Xms32m -Xmx4096m -javaagent:$JANUSGRAPH_LIB/jamm-0.3.0.jar -

Answer 2

Why is this failing?

我想知道您是否运行即使增加了 Xmx 内存问题，因为您正在执行的脚本在服务器上的单个事务中完成所有工作？也许您应该在每个批处理完成后尝试执行 g.tx().rollback() 以查看是否可以解决问题。

Is there a better and faster way to extract all of the edge IDs?

如果你有数百万条边，那么最有效的方法就是使用 spark-gremlin。可以找到这样做的文档 here。除此之外，我不会费心使用 Gremlin 服务器，只需在 Gremlin 控制台中创建一个 JanusGraph 实例并在本地执行该脚本。

Gremlin - Memory/time 从图中获取所有边 ID 的有效方法

Gremlin - Memory/time efficient way to get all edge IDs from a graph

groovy