使用 ArangoDB AQL 计算字符串出现次数
Counting string occurrences with ArangoDB AQL
要计算包含特定属性值的对象的数量,我可以这样做:
FOR t IN thing
COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
FILTER other != false
RETURN otherCount
但是我如何计算同一个查询中的其他三个事件,而不导致多次通过同一个数据集进行子查询 运行?
我试过类似的东西:
FOR t IN thing
COLLECT
other = t.name = "Other",
some = t.name = "Some",
thing = t.name = "Thing"
WITH COUNT INTO count
RETURN {
other, some, thing,
count
}
但我无法理解结果:我一定是以错误的方式处理这个问题的?
拆分并计数
您可以按短语拆分字符串并从计数中减去 1。这适用于任何子字符串,另一方面意味着它不考虑单词边界。
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(SPLIT(t.name, "Some"))-1
LET Other = LENGTH(SPLIT(t.name, "Other"))-1
LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
RETURN {
Some, Other, Thing
}
结果:
[
{
"Some": 3,
"Other": 2,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
您可以使用 SPLIT(LOWER(t.name), LOWER("..."))
使其不区分大小写。
收集单词
可以利用TOKENS()
函数将输入拆分为单词数组,然后进行分组统计。请注意,我稍微更改了输入。输入 "SomeSome"
将不会被计算在内,因为 "somesome" != "some"
(此变体是基于单词而不是基于子字符串的)。
LET things = [
{name: "Here are SOME some and Some Other Things. More Other!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")
FOR t IN things
LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
LET counts = MERGE(FOR w IN whitelisted
COLLECT word = w WITH COUNT INTO count
RETURN { [word]: count }
)
RETURN {
name: t.name,
some: counts.some || 0,
other: counts.other || 0,
things: counts.things ||0
}
结果:
[
{
"name": "Here are SOME some and Some Other Things. More Other!",
"some": 3,
"other": 2,
"things": 0
},
{
"name": "There are no such substrings in here.",
"some": 0,
"other": 0,
"things": 0
},
{
"name": "some-Other-here-though!",
"some": 1,
"other": 1,
"things": 0
}
]
这确实使用了 COLLECT 的子查询,否则它会计算整个输入的总出现次数。
白名单步骤不是必须的,你也可以让它统计所有的单词。对于较大的输入字符串,不对您不感兴趣的单词执行此操作可能会节省一些内存。
您可能想要创建一个单独的 Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"
)。另一种方法是将 REGEX_SPLIT()
用于典型的空白和标点符号以进行更简单的标记化,但这取决于您的用例。
其他解决方案
我认为在没有子查询的情况下使用 COLLECT 独立计算每个输入对象是不可能的,除非你想要一个总数。
拆分有点麻烦,但您可以用 REGEX_SPLIT() 替换 SPLIT() 并将搜索短语包装在 \b
中,以便仅在单词边界位于两侧时才匹配。那么它应该只匹配单词(或多或少):
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(REGEX_SPLIT(t.name, "\bSome\b"))-1
LET Other = LENGTH(REGEX_SPLIT(t.name, "\bOther\b"))-1
LET Thing = LENGTH(REGEX_SPLIT(t.name, "\bThings\b"))-1
RETURN {
Some, Other, Thing
}
结果:
[
{
"Some": 1,
"Other": 1,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
一个更优雅的解决方案是利用 ArangoSearch 进行字数统计,但它没有让您检索一个字词出现频率的功能。它可能会在内部跟踪(分析器功能 "frequency"),但此时肯定不会公开。
要计算包含特定属性值的对象的数量,我可以这样做:
FOR t IN thing
COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
FILTER other != false
RETURN otherCount
但是我如何计算同一个查询中的其他三个事件,而不导致多次通过同一个数据集进行子查询 运行?
我试过类似的东西:
FOR t IN thing
COLLECT
other = t.name = "Other",
some = t.name = "Some",
thing = t.name = "Thing"
WITH COUNT INTO count
RETURN {
other, some, thing,
count
}
但我无法理解结果:我一定是以错误的方式处理这个问题的?
拆分并计数
您可以按短语拆分字符串并从计数中减去 1。这适用于任何子字符串,另一方面意味着它不考虑单词边界。
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(SPLIT(t.name, "Some"))-1
LET Other = LENGTH(SPLIT(t.name, "Other"))-1
LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
RETURN {
Some, Other, Thing
}
结果:
[
{
"Some": 3,
"Other": 2,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
您可以使用 SPLIT(LOWER(t.name), LOWER("..."))
使其不区分大小写。
收集单词
可以利用TOKENS()
函数将输入拆分为单词数组,然后进行分组统计。请注意,我稍微更改了输入。输入 "SomeSome"
将不会被计算在内,因为 "somesome" != "some"
(此变体是基于单词而不是基于子字符串的)。
LET things = [
{name: "Here are SOME some and Some Other Things. More Other!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")
FOR t IN things
LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
LET counts = MERGE(FOR w IN whitelisted
COLLECT word = w WITH COUNT INTO count
RETURN { [word]: count }
)
RETURN {
name: t.name,
some: counts.some || 0,
other: counts.other || 0,
things: counts.things ||0
}
结果:
[
{
"name": "Here are SOME some and Some Other Things. More Other!",
"some": 3,
"other": 2,
"things": 0
},
{
"name": "There are no such substrings in here.",
"some": 0,
"other": 0,
"things": 0
},
{
"name": "some-Other-here-though!",
"some": 1,
"other": 1,
"things": 0
}
]
这确实使用了 COLLECT 的子查询,否则它会计算整个输入的总出现次数。
白名单步骤不是必须的,你也可以让它统计所有的单词。对于较大的输入字符串,不对您不感兴趣的单词执行此操作可能会节省一些内存。
您可能想要创建一个单独的 Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"
)。另一种方法是将 REGEX_SPLIT()
用于典型的空白和标点符号以进行更简单的标记化,但这取决于您的用例。
其他解决方案
我认为在没有子查询的情况下使用 COLLECT 独立计算每个输入对象是不可能的,除非你想要一个总数。
拆分有点麻烦,但您可以用 REGEX_SPLIT() 替换 SPLIT() 并将搜索短语包装在 \b
中,以便仅在单词边界位于两侧时才匹配。那么它应该只匹配单词(或多或少):
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(REGEX_SPLIT(t.name, "\bSome\b"))-1
LET Other = LENGTH(REGEX_SPLIT(t.name, "\bOther\b"))-1
LET Thing = LENGTH(REGEX_SPLIT(t.name, "\bThings\b"))-1
RETURN {
Some, Other, Thing
}
结果:
[
{
"Some": 1,
"Other": 1,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
一个更优雅的解决方案是利用 ArangoSearch 进行字数统计,但它没有让您检索一个字词出现频率的功能。它可能会在内部跟踪(分析器功能 "frequency"),但此时肯定不会公开。