电子邮件集合的高效索引，用于按电子邮件域排序和过滤

Question

我正在使用 Mongoose 来集中收集电子邮件地址，我也有用户和组织的集合。在我的应用程序中，我通过用户（已验证）的电子邮件域将用户与组织相关联。例如。 Acme Ltd 拥有域 acme.com 和 acme.co.uk，通过 select 使用这些域的所有电子邮件，我可以整理一个唯一的关联用户列表。

用户可以有多个电子邮件地址（1 个主电子邮件地址 + 多个辅助电子邮件地址）。用户不能共享电子邮件地址（因此 "verifiedBy" 字段在用户和电子邮件之间强制执行一对一关系）。

我的架构（当前）如下：

const emailSchema = new Schema({
    _id: { 
        type: String,
        get: function idReverse(_id) { if(_id) return _id.split("@").reverse().join("@"); },
        set: (str) => { str.trim().toLowerCase().split("@").reverse().join("@") }
    },
    verifiedBy: { type: String, ref: 'User' }
}, options );

My question is whether it is worth reversing the domain parts of the email address in the setter, and unreversing them in the getter - as I've shown - in order that the underlying MongoDb index on _id can improve performance & make it easier to deal with the kinds of lookups I've described?

我已经考虑过的备选方案是：

按原样存储电子邮件并使用正则表达式按域部分 select 用户（对我来说处理明智的感觉很昂贵）
将域部分存储在一个单独的字段中并对其进行索引（感觉很昂贵，因为有两个索引和重复的数据存储）

Answer 1

第一个选项实际上应该工作得很好。根据 $regex docs:

[...] Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. [...]

A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. [...]

实验

让我们看看它是如何处理一个包含约 80 万个文档的集合的，其中约 25% 的文档有电子邮件。分析的示例查询是 {email: /^gmail/}.

没有索引：

db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
//    "nReturned" : 2208,
//    "executionTimeMillis" : 250,
//    "totalKeysExamined" : 0,
//    "totalDocsExamined" : 202720,
// ...

具有 {email: 1} 索引：

db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
//    "nReturned" : 2208,
//    "executionTimeMillis" : 5,
//    "totalKeysExamined" : 2209,
//    "totalDocsExamined" : 2208,
// ...

如我们所见，它绝对有帮助 - 无论是在执行时间还是检查文档方面（更多检查文档意味着可能有更多 IO 工作）。如果我们忽略前缀并更直接地使用查询，让我们看看它是如何工作的：{email: /gmail/}.

没有索引：

db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
//    "nReturned" : 2217,
//    "executionTimeMillis" : 327,
//    "totalKeysExamined" : 0,
//    "totalDocsExamined" : 202720,
// ...

具有 {email: 1} 索引：

db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
//    "nReturned" : 2217,
//    "executionTimeMillis" : 210,
//    "totalKeysExamined" : 200616,
//    "totalDocsExamined" : 2217,
// ...

最后，索引有很大帮助，尤其是在执行前缀查询时。看起来前缀查询足够快，可以在单个字段中保持原样。单独的字段可能更好地利用索引（使用它！），但我认为 5 毫秒就足够了。

一如既往，我强烈建议您对数据进行测试并查看其性能，因为数据特征可能会影响性能。

电子邮件集合的高效索引，用于按电子邮件域排序和过滤

Efficient indexing of an emails collection for ordering & filtering by email domain

data-modeling

mongoose

mongodb

实验