Neo4j Cypher:当基数很高时如何优化 NOT EXISTS 查询
Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high
当发帖的基数 b/w 用户约为 8000(一个用户查看约 8000 个帖子)时,以下查询占用 1 秒以上的时间和消费者约 7 MB。由于高且线性增长的延迟和内存消耗,很难对其进行扩展。是否有可能对此进行不同的建模and/or 优化查询?
查询
PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID
计划
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 |
+-----------------------------------------------------------------------------------------------------------+
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +ProduceResults@neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Projection@neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Filter@neo4j | u:User | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Expand(All)@neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +AntiSemiApply@neo4j | | 2158 | 10 | 0 | 0/0 | | |
| |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +Expand(Into)@neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC |
| | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +NodeUniqueIndexSeek@neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +NodeByLabelScan@neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
是的,这可以改进。
首先,让我们了解一下这是做什么的。
首先,它从 NodeByLabelScan 开始。有道理,没法回避。
但是,对于标签的每个节点(以下执行 PER ROW!),它与用户 2 匹配,并展开所有 :VIEWED 来自用户 2 的关系,以查看它们是否是 post 对于那个特定的行。
你能看出为什么这样效率低下吗?按照PROFILE计划有8630个post节点,所以用户2被索引查找了8630次,他们的:VIEWED关系被展开了8630次。为什么是8630次?因为这是每个 :Post 节点发生的。
相反,试试这个:
MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost)
WITH collect(viewedPost) as viewedPosts
MATCH (:User)-[:CREATED]->(p:Post)
WHERE NOT p IN viewedPosts
RETURN p.ID
这会稍微改变一下。
首先它与用户 2 的查看 post 匹配(查找和扩展只执行一次),然后收集那些查看的 post。
然后它将进行标签扫描,并进行过滤,使 post 不在已查看 post 的集合中。
当发帖的基数 b/w 用户约为 8000(一个用户查看约 8000 个帖子)时,以下查询占用 1 秒以上的时间和消费者约 7 MB。由于高且线性增长的延迟和内存消耗,很难对其进行扩展。是否有可能对此进行不同的建模and/or 优化查询?
查询
PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID
计划
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 |
+-----------------------------------------------------------------------------------------------------------+
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +ProduceResults@neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Projection@neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Filter@neo4j | u:User | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Expand(All)@neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +AntiSemiApply@neo4j | | 2158 | 10 | 0 | 0/0 | | |
| |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +Expand(Into)@neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC |
| | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +NodeUniqueIndexSeek@neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +NodeByLabelScan@neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
是的,这可以改进。
首先,让我们了解一下这是做什么的。
首先,它从 NodeByLabelScan 开始。有道理,没法回避。
但是,对于标签的每个节点(以下执行 PER ROW!),它与用户 2 匹配,并展开所有 :VIEWED 来自用户 2 的关系,以查看它们是否是 post 对于那个特定的行。
你能看出为什么这样效率低下吗?按照PROFILE计划有8630个post节点,所以用户2被索引查找了8630次,他们的:VIEWED关系被展开了8630次。为什么是8630次?因为这是每个 :Post 节点发生的。
相反,试试这个:
MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost)
WITH collect(viewedPost) as viewedPosts
MATCH (:User)-[:CREATED]->(p:Post)
WHERE NOT p IN viewedPosts
RETURN p.ID
这会稍微改变一下。
首先它与用户 2 的查看 post 匹配(查找和扩展只执行一次),然后收集那些查看的 post。
然后它将进行标签扫描,并进行过滤,使 post 不在已查看 post 的集合中。