使用 PyMongo 进行词边界 RegEx 搜索

Question

我想进行词界搜索。例如，假设您有以下条目：

"the cooks."
"cooks"
“厨师。”
"the cook is"
"cook."

并进行搜索以查找整体包含 "cook" 的条目。也就是说，只应返回第 3、4 和 5 个条目。

在这种情况下，当我使用 \b 单词边界语句时，由于自动转义，它不知何故变得扭曲。

import re, pymongo
# prepare pymongo
collection.find({"entry": re.compile('\bcook\b').pattern})

当我打印查询字典时，\b 变为 \b。

我的问题是如何使用 PyMongo 进行词边界搜索？我可以在 MongoDB shell 中执行此操作，但在 PyMongo 中失败。

Answer 1

不要使用生成 str 对象的 pattern 属性，而是使用正则表达式模式对象。

cursor = db.your_collection.find({"field": re.compile(r'\bcook\b')})

for doc in cursor:
    # your code

Answer 2

这需要一个 "full-text search" 索引来匹配您的所有案例。没有简单的 RegEx 就足够了。

例如，您需要英语词干提取才能同时找到 "cook" 和 "cooks"。您的正则表达式匹配空格或单词边界之间的整个字符串 "cook"，而不是 "cooks" 或 "cooking".

有很多 "full text search" 索引引擎。研究它们以决定使用哪一个。 - 弹性搜索 - 卢塞恩 - 狮身人面像

我假设 PyMongo 连接到 MongoDB。最新版本内置了全文索引。见下文。

MongDB 3.0 有这些索引：https://docs.mongodb.org/manual/core/index-text/

Answer 3

所有这些测试用例都由 Python 中的简单 re 表达式处理。示例：

>>> a = "the cooks."
>>> b = "cooks"
>>> c = " cook."
>>> d = "the cook is"
>>> e = "cook."
>>> tests = [a,b,c,d,e]
>>> for test in tests:
        rc = re.match("[^c]*(cook)[^s]", test)
        if rc:
                print '   Found: "%s" in "%s"' % (rc.group(1), test)
        else:
                print '   Search word NOT found in "%s"' % test


   Search word NOT found in "the cooks."
   Search word NOT found in "cooks"
   Found: "cook" in " cook."
   Found: "cook" in "the cook is"
   Found: "cook" in "cook."
>>>

使用 PyMongo 进行词边界 RegEx 搜索

Word boundary RegEx search using PyMongo

python

regex

mongodb

pymongo