Neo4j data-model 的文档、关键字和词干搜索
Neo4j data-model of documents, keywords, and word stems for searching
我的目标是使用 neo4j 对文档进行两种不同的搜索。我将使用食谱(文档)作为示例。
假设我有一份配料表 (key-words) on-hand(牛奶、黄油、面粉、盐、糖、鸡蛋......),我的数据库中有一些食谱,每个食谱都附有配料。我想输入我的列表并得到两个不同的结果。一个是最接近包含我输入的所有成分的食谱。第二个是食谱的组合,其中包含我所有的成分。
给予:牛奶、黄油、面粉、盐、糖、鸡蛋
第一种情况的搜索结果可能是:
1.)糖饼干
2.)黄油饼干
第二个结果可能是:
1.)扁面包和Gogel-Mogel
我正在阅读要插入到 neo4j 中的食谱,然后从每个食谱顶部的配料列表中提取配料,然后还从食谱说明中提取配料。我想以不同的方式权衡这些,也许 60/40 有利于配料表。
我还想对每种成分进行词干处理,以防人们输入相似的词。
我正在努力在 neo4j 中想出一个好的数据模型。我计划让用户输入英文成分,我会在后台提取它们,并用它来搜索。
我的第一个想法是:
这对我来说很直观,但是要找到所有的食谱需要很多步骤。
接下来可能是这个:
直接从词干获取食谱,但我需要在关系中传递食谱 ID(对吗?)以获取实际成分。
第三,也许这样组合起来?
但是有很多重复。
这里还有一些 CYPHER 语句来创建第一个想法:
//Create 4 recipes
create (r1:Recipe {rid:'1', title:'Sugar cookies'}), (r2:Recipe {rid:'2', title:'Butter cookies'}),
(r3:Recipe {rid:'3', title:'Flat bread'}), (r4:Recipe {rid:'4', title:'Gogel-Mogel'})
//Adding some ingredients
merge (i1:Ingredient {ingredient:"salted butter"})
merge (i2:Ingredient {ingredient:"white sugar"})
merge (i3:Ingredient {ingredient:"brown sugar"})
merge (i4:Ingredient {ingredient:"all purpose flour"})
merge (i5:Ingredient {ingredient:"iodized salt"})
merge (i6:Ingredient {ingredient:"eggs"})
merge (i7:Ingredient {ingredient:"milk"})
merge (i8:Ingredient {ingredient:"powdered sugar"})
merge (i9:Ingredient {ingredient:"wheat flour"})
merge (i10:Ingredient {ingredient:"bananas"})
merge (i11:Ingredient {ingredient:"chocolate chips"})
merge (i12:Ingredient {ingredient:"raisins"})
merge (i13:Ingredient {ingredient:"unsalted butter"})
merge (i14:Ingredient {ingredient:"wheat flour"})
merge (i15:Ingredient {ingredient:"himalayan salt"})
merge (i16:Ingredient {ingredient:"chocolate bars"})
merge (i17:Ingredient {ingredient:"vanilla flavoring"})
merge (i18:Ingredient {ingredient:"vanilla"})
//Stems added to each ingredient
merge (i1)<-[:STEM_OF]-(s1:Stem {stem:"butter"})
merge (i2)<-[:STEM_OF]-(s2:Stem {stem:"sugar"})
merge (i3)<-[:STEM_OF]-(s2)
merge (i4)<-[:STEM_OF]-(s4:Stem {stem:"flour"})
merge (i5)<-[:STEM_OF]-(s5:Stem {stem:"salt"})
merge (i6)<-[:STEM_OF]-(s6:Stem {stem:"egg"})
merge (i7)<-[:STEM_OF]-(s7:Stem {stem:"milk"})
merge (i8)<-[:STEM_OF]-(s2)
merge (i9)<-[:STEM_OF]-(s4)
merge (i10)<-[:STEM_OF]-(s10:Stem {stem:"banana"})
merge (i11)<-[:STEM_OF]-(s11:Stem {stem:"chocolate"})
merge (i12)<-[:STEM_OF]-(s12:Stem {stem:"raisin"})
merge (i13)<-[:STEM_OF]-(s1)
merge (i14)<-[:STEM_OF]-(s4)
merge (i15)<-[:STEM_OF]-(s5)
merge (i16)<-[:STEM_OF]-(s11)
merge (i17)<-[:STEM_OF]-(s13:Stem {stem:"vanilla"})
merge (i18)<-[:STEM_OF]-(s13)
create (r1)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r1)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)
create (r1)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i6)
create (r1)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r1)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)
create (r2)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r2)<-[:INGREDIENTS_LIST{weight:.6}]-(i3)
create (r2)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r2)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r2)<-[:INGREDIENTS_LIST{weight:.3}]-(i6)
create (r2)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r2)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)
create (r3)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r3)<-[:INGREDIENTS_LIST{weight:.6}]-(i5)
create (r3)<-[:INGREDIENTS_LIST{weight:.5}]-(i7)
create (r3)<-[:INGREDIENTS_LIST{weight:.4}]-(i9)
create (r4)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)
create (r4)<-[:INGREDIENTS_LIST{weight:.5}]-(i6)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i1)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i2)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i6)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r2)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i3)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i6)
create (r2)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i5)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i9)
create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i2)
create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i6)
和一个 link 到带有上述语句的 neo4j 控制台:
http://console.neo4j.org/?id=3o8y44
neo4j 有多关心多重关系?另外,我可以做一种成分,但我如何将一个查询放在一起以获得给出不止一种成分的食谱?
编辑:
谢谢迈克尔!这让我更进一步。我能够扩展您对此的回答:
WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms UNWIND
terms as term MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lst:INGREDIENTS_LIST]->(r:Recipe) WITH r,
size(terms) - count(distinct stem) as notCovered, sum(lst.weight) as weight,
collect(distinct stem.stem) as matched RETURN r , notCovered,matched, weight
ORDER BY notCovered ASC, weight DESC
并得到了匹配的成分列表和重量。我将如何更改查询以同时显示 :INGREDIENTS_INSTR 关系的权重,以便我可以同时使用这两个权重进行计算? [lst:INGREDIENTS_LIST|INGREDIENTS_INSTR] 不是我想要的。
编辑:
这似乎有效,对吗?
WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms UNWIND
terms as term MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lstl:INGREDIENTS_LIST]->(r:Recipe)<-
[lsti:INGREDIENTS_INSTR]-(ingredient:Ingredient) WITH r, size(terms) -
count(distinct stem) as notCovered, sum(lsti.weight) as wi, sum(lstl.weight)
as wl, collect(distinct stem.stem) as matched RETURN r ,
notCovered,matched, wl+wi ORDER BY notCovered ASC, wl+wi DESC
另外,你能帮忙解答第二个问题吗?在给定成分列表的情况下,将返回包含给定成分的食谱组合。再次感谢!
我会选择你的版本 1)。
不用担心额外的跃点。
你会把关于数量/重量的信息放在食谱和实际成分之间的关系上。
您可以有多个关系。
这是一个示例查询,它不适用于您的数据集,因为您没有包含所有成分的食谱:
WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms
UNWIND terms as term
MATCH (stem:Stem {stem:term})-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe)
WITH r, size(terms) - count(distinct stem) as notCovered
RETURN r ORDER BY notCovered ASC LIMIT 2
+-----------------------------------------+
| r |
+-----------------------------------------+
| Node[0]{rid:"1",title:"Sugar cookies"} |
| Node[1]{rid:"2",title:"Butter cookies"} |
+-----------------------------------------+
2 rows
以下是针对大型数据集的优化:
而对于查询,您将首先找到所有成分,然后
菜谱附上最挑剔的(度数最低的)
然后根据每个食谱检查剩余成分。
WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms
MATCH (stem:Stem) WHERE stem.stem IN terms
// highest selective stem first
WITH stem, terms ORDER BY size((stem)-[:STEM_OF]->()) ASC
WITH terms, collect(stem) as stems
WITH head(stems) first, tail(stems) as rest, terms
MATCH (first)-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe)
WHERE size[other IN rest WHERE (other)-[:STEM_OF]->(:Ingredient)-->(r)] as covered
WITH r, size(terms) - 1 - covered as notCovered
RETURN r ORDER BY notCovered ASC LIMIT 2
我的目标是使用 neo4j 对文档进行两种不同的搜索。我将使用食谱(文档)作为示例。 假设我有一份配料表 (key-words) on-hand(牛奶、黄油、面粉、盐、糖、鸡蛋......),我的数据库中有一些食谱,每个食谱都附有配料。我想输入我的列表并得到两个不同的结果。一个是最接近包含我输入的所有成分的食谱。第二个是食谱的组合,其中包含我所有的成分。
给予:牛奶、黄油、面粉、盐、糖、鸡蛋
第一种情况的搜索结果可能是:
1.)糖饼干
2.)黄油饼干
第二个结果可能是:
1.)扁面包和Gogel-Mogel
我正在阅读要插入到 neo4j 中的食谱,然后从每个食谱顶部的配料列表中提取配料,然后还从食谱说明中提取配料。我想以不同的方式权衡这些,也许 60/40 有利于配料表。
我还想对每种成分进行词干处理,以防人们输入相似的词。
我正在努力在 neo4j 中想出一个好的数据模型。我计划让用户输入英文成分,我会在后台提取它们,并用它来搜索。
我的第一个想法是:
接下来可能是这个:
直接从词干获取食谱,但我需要在关系中传递食谱 ID(对吗?)以获取实际成分。
第三,也许这样组合起来?
这里还有一些 CYPHER 语句来创建第一个想法:
//Create 4 recipes
create (r1:Recipe {rid:'1', title:'Sugar cookies'}), (r2:Recipe {rid:'2', title:'Butter cookies'}),
(r3:Recipe {rid:'3', title:'Flat bread'}), (r4:Recipe {rid:'4', title:'Gogel-Mogel'})
//Adding some ingredients
merge (i1:Ingredient {ingredient:"salted butter"})
merge (i2:Ingredient {ingredient:"white sugar"})
merge (i3:Ingredient {ingredient:"brown sugar"})
merge (i4:Ingredient {ingredient:"all purpose flour"})
merge (i5:Ingredient {ingredient:"iodized salt"})
merge (i6:Ingredient {ingredient:"eggs"})
merge (i7:Ingredient {ingredient:"milk"})
merge (i8:Ingredient {ingredient:"powdered sugar"})
merge (i9:Ingredient {ingredient:"wheat flour"})
merge (i10:Ingredient {ingredient:"bananas"})
merge (i11:Ingredient {ingredient:"chocolate chips"})
merge (i12:Ingredient {ingredient:"raisins"})
merge (i13:Ingredient {ingredient:"unsalted butter"})
merge (i14:Ingredient {ingredient:"wheat flour"})
merge (i15:Ingredient {ingredient:"himalayan salt"})
merge (i16:Ingredient {ingredient:"chocolate bars"})
merge (i17:Ingredient {ingredient:"vanilla flavoring"})
merge (i18:Ingredient {ingredient:"vanilla"})
//Stems added to each ingredient
merge (i1)<-[:STEM_OF]-(s1:Stem {stem:"butter"})
merge (i2)<-[:STEM_OF]-(s2:Stem {stem:"sugar"})
merge (i3)<-[:STEM_OF]-(s2)
merge (i4)<-[:STEM_OF]-(s4:Stem {stem:"flour"})
merge (i5)<-[:STEM_OF]-(s5:Stem {stem:"salt"})
merge (i6)<-[:STEM_OF]-(s6:Stem {stem:"egg"})
merge (i7)<-[:STEM_OF]-(s7:Stem {stem:"milk"})
merge (i8)<-[:STEM_OF]-(s2)
merge (i9)<-[:STEM_OF]-(s4)
merge (i10)<-[:STEM_OF]-(s10:Stem {stem:"banana"})
merge (i11)<-[:STEM_OF]-(s11:Stem {stem:"chocolate"})
merge (i12)<-[:STEM_OF]-(s12:Stem {stem:"raisin"})
merge (i13)<-[:STEM_OF]-(s1)
merge (i14)<-[:STEM_OF]-(s4)
merge (i15)<-[:STEM_OF]-(s5)
merge (i16)<-[:STEM_OF]-(s11)
merge (i17)<-[:STEM_OF]-(s13:Stem {stem:"vanilla"})
merge (i18)<-[:STEM_OF]-(s13)
create (r1)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r1)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)
create (r1)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i6)
create (r1)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r1)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)
create (r2)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r2)<-[:INGREDIENTS_LIST{weight:.6}]-(i3)
create (r2)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r2)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r2)<-[:INGREDIENTS_LIST{weight:.3}]-(i6)
create (r2)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r2)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)
create (r3)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r3)<-[:INGREDIENTS_LIST{weight:.6}]-(i5)
create (r3)<-[:INGREDIENTS_LIST{weight:.5}]-(i7)
create (r3)<-[:INGREDIENTS_LIST{weight:.4}]-(i9)
create (r4)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)
create (r4)<-[:INGREDIENTS_LIST{weight:.5}]-(i6)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i1)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i2)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i6)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r2)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i3)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i6)
create (r2)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i5)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i9)
create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i2)
create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i6)
和一个 link 到带有上述语句的 neo4j 控制台: http://console.neo4j.org/?id=3o8y44
neo4j 有多关心多重关系?另外,我可以做一种成分,但我如何将一个查询放在一起以获得给出不止一种成分的食谱?
编辑: 谢谢迈克尔!这让我更进一步。我能够扩展您对此的回答:
WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms UNWIND
terms as term MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lst:INGREDIENTS_LIST]->(r:Recipe) WITH r,
size(terms) - count(distinct stem) as notCovered, sum(lst.weight) as weight,
collect(distinct stem.stem) as matched RETURN r , notCovered,matched, weight
ORDER BY notCovered ASC, weight DESC
并得到了匹配的成分列表和重量。我将如何更改查询以同时显示 :INGREDIENTS_INSTR 关系的权重,以便我可以同时使用这两个权重进行计算? [lst:INGREDIENTS_LIST|INGREDIENTS_INSTR] 不是我想要的。
编辑:
这似乎有效,对吗?
WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms UNWIND
terms as term MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lstl:INGREDIENTS_LIST]->(r:Recipe)<-
[lsti:INGREDIENTS_INSTR]-(ingredient:Ingredient) WITH r, size(terms) -
count(distinct stem) as notCovered, sum(lsti.weight) as wi, sum(lstl.weight)
as wl, collect(distinct stem.stem) as matched RETURN r ,
notCovered,matched, wl+wi ORDER BY notCovered ASC, wl+wi DESC
另外,你能帮忙解答第二个问题吗?在给定成分列表的情况下,将返回包含给定成分的食谱组合。再次感谢!
我会选择你的版本 1)。
不用担心额外的跃点。 你会把关于数量/重量的信息放在食谱和实际成分之间的关系上。
您可以有多个关系。
这是一个示例查询,它不适用于您的数据集,因为您没有包含所有成分的食谱:
WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms
UNWIND terms as term
MATCH (stem:Stem {stem:term})-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe)
WITH r, size(terms) - count(distinct stem) as notCovered
RETURN r ORDER BY notCovered ASC LIMIT 2
+-----------------------------------------+
| r |
+-----------------------------------------+
| Node[0]{rid:"1",title:"Sugar cookies"} |
| Node[1]{rid:"2",title:"Butter cookies"} |
+-----------------------------------------+
2 rows
以下是针对大型数据集的优化:
而对于查询,您将首先找到所有成分,然后 菜谱附上最挑剔的(度数最低的)
然后根据每个食谱检查剩余成分。
WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms
MATCH (stem:Stem) WHERE stem.stem IN terms
// highest selective stem first
WITH stem, terms ORDER BY size((stem)-[:STEM_OF]->()) ASC
WITH terms, collect(stem) as stems
WITH head(stems) first, tail(stems) as rest, terms
MATCH (first)-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe)
WHERE size[other IN rest WHERE (other)-[:STEM_OF]->(:Ingredient)-->(r)] as covered
WITH r, size(terms) - 1 - covered as notCovered
RETURN r ORDER BY notCovered ASC LIMIT 2