Neo4j 快速匹配模糊文本的方法属性

Question

我有合理数量的节点（大约 60,000）

(:Document {title:"A title"})

给定一个标题，我想找到匹配的节点，如果它存在的话。问题是给我的标题不一致。也就是说，有时新词的开头是大写，有时全是小写。有时 Key-Words 与 Kebab 大小写结合，有时它们通常写为关键字。

为了弥补这一点，我在给定标题和每个节点之间使用 apoc 和 Levenshtein 距离，并且只接受低于某个阈值的节点作为匹配项：

MATCH (a:Document)
WHERE apoc.text.distance(a.title, "A title") < 10
RETURN a

这不能很好地扩展。目前，一次查找大约需要 700 毫秒，这太慢了，因为这可能会增长到大约 150,000 个节点。

我正在考虑在节点的 alias:[...] 属性中存储/缓存替代标题的出现，并在所有别名上建立索引，但我不知道这是否可能在 Neo4j 中。

在给定大型节点数据库的情况下，"fuzzy find"标题的最快方法是什么？

Answer 1

在 Neo4j 3.5（目前为 beta03）中，有 FTS（Full-Text 搜索）功能。

编辑：我写了一篇关于 Neo4j 中的 FTS 的详细博客 post：https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html

您可以使用 Lucene Classic Query Parser Syntax.

查询您的文档

创建索引：

CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text'])

导入一些文件:

LOAD CSV WITH HEADERS FROM "file:///docs.csv" AS row
CREATE (n:Document) SET n = row

查询标题包含"heavy toll"

的文档

CALL db.index.fulltext.queryNodes('documents', 'title: "heavy toll"')
YIELD node, score
RETURN node.title, score

╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node.title"                                                          │"score"           │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│3.7325966358184814│
└──────────────────────────────────────────────────────────────────────┴──────────────────┘

查询同一个标题有错别字：

CALL db.index.fulltext.queryNodes('documents', 'title: \"heavy~ tall~\"')
YIELD node, score
RETURN node.title, score

注意引号的转义 => \" ，传递给底层解析器的字符串应该包含引号，以便执行短语查询而不是布尔查询。

此外，术语旁边的 tidle 表示使用 Damarau-Levenshtein 算法执行模糊搜索。

╒══════════════════════════════════════════════════════════════════════╤═════════════════════╕
│"node.title"                                                          │"score"              │
╞══════════════════════════════════════════════════════════════════════╪═════════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│0.868073046207428    │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Prisons Run by C.E.O.s? Privatization Under Trump Could Carry a Heavy│0.4014900326728821   │
│ Price - The New York Times"                                          │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"‘All Talk,’ ‘No Action,’ Says Trump in Twitter Attack on Civil Rights│0.28181418776512146  │
│ Icon - The New York Times"                                           │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Immigrants Head to Washington to Rally While Obama Is Still There - T│0.24634429812431335  │
│he New York Times"                                                    │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤

Neo4j 快速匹配模糊文本的方法属性

Neo4j Fast way to match fuzzy text property

indexing

alias

neo4j

cypher

Neo4j 快速匹配模糊文本的方法 属性

Neo4j Fast way to match fuzzy text property

indexing

alias

neo4j

cypher

Neo4j 快速匹配模糊文本的方法属性