在 ElasticSearch 中存储时间相关数据

Question

我想在 ElasticSearch 中存储与时间相关的数据。每个文档都有一个开始时间和一个结束时间，它们定义了它何时相关（例如，一张未结的工单）。然后我希望能够运行查询，例如：

显示 1 月 5 日至 2 月 2 日期间开放的所有工单。
在某个时间范围内打开的所有工单的多个字段上显示分面

因为我会有每月索引，跨度超过一个月的文档将存储在多个索引中。
比如1月到4月开的工单需要存储在所有4个月的索引中。
我想知道是否有一种简单的方法运行稍后在跨索引的聚合中知道将每张票只考虑一次（用于分面等）。

例如，如果我有以下门票：

工单 A 在 1/1/2021-17/1/2021 期间开放
工单 B 在 15/1/2021-16/2/2021 期间开放
票证 C 在 12/1/2021-16/3/2021 期间开放

我们将有以下documents/indexes：

Index January:  
  {id: A, StartDate: 1/1/2021, EndDate: 17/1/2021}
  {id: B, StartDate: 15/1/2021}
  {id: C, StartDate: 12/1/2021}
Index February:  
  {id: B, StartDate: 15/1/2021, EndDate: 16/2/2021}
  {id: C, StartDate: 12/1/2021}
Index March:  
  {id: C, StartDate: 12/1/2021, endDate: 16/3/2021}

欢迎任何与使用 ElasticSearch 实现此类功能的最佳方式相关的意见。如果有其他首选数据库用于大规模和快速聚合的工作，我也很乐意听到替代方案。

Answer 1

如果满足两个条件，这将非常简单：

每个文档恰好有一个 StartDate 值。
每个文档都有 EndDate 也有 StartDate.

因此，如果我们设置一个 index template 涵盖我们所有未来的月度指数：

PUT _index_template/monthly-indices
{
  "index_patterns": [
    "index-2021-*"
  ],
  "template": {
    "mappings": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "StartDate": {
          "type": "date",
          "format": "d/M/yyyy"
        },
        "EndDate": {
          "type": "date",
          "format": "d/M/yyyy"
        }
      }
    }
  }
}

然后我们可以添加您问题中的示例文档：

POST index-2021-01/_doc/A
{
  "id": "A",
  "StartDate": "1/1/2021",
  "EndDate": "17/1/2021"
}

POST index-2021-01/_doc/B
{
  "id": "B",
  "StartDate": "15/1/2021"
}

POST index-2021-02/_doc/B
{
  "id": "B",
  "StartDate": "15/1/2021",
  "EndDate": "16/2/2021"
}

POST index-2021-02/_doc/C
{
  "id": "C",
  "StartDate": "12/2/2021"
}

POST index-2021-03/_doc/C
{
  "id": "C",
  "StartDate": "12/2/2021",
  "EndDate": "16/3/2021"
}

之后，问题“哪些门票在 1/1/2021 和 31/3/2021 之间开放？”然后可以通过以下方式回答：

POST index-2021-*/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "StartDate": {
              "gte": "1/1/2021"
            }
          }
        },
        {
          "range": {
            "EndDate": {
              "lte": "31/3/2021"
            }
          }
        }
      ]
    }
  }
}

returns 文档 A、B 和 C——但显然只有那些具有开始和结束日期的条目。

展望未来，可以使用此 scripted date_histogram 聚合构建一个简单的“方面”统计数据，表示“每月有多少张票”：

POST index-2021-*/_search
{
  "size": 0,
  "aggs": {
    "open_count_by_months": {
      "date_histogram": {
        "interval": "month",
        "format": "MMMM yyyy", 
        "script": """
          def start = doc['StartDate'].value;
          def end = doc['EndDate'].size() == 0 ? null : doc['EndDate'].value;
          
          if (end == null) {
            return start
          }
          
          return end
        """
      }
    }
  }
}

屈服

"aggregations" : {
  "open_count_by_months" : {
    "buckets" : [
      {
        "key_as_string" : "January 2021",
        "key" : 1609459200000,
        "doc_count" : 2
      },
      {
        "key_as_string" : "February 2021",
        "key" : 1612137600000,
        "doc_count" : 2
      },
      {
        "key_as_string" : "March 2021",
        "key" : 1614556800000,
        "doc_count" : 1
      }
    ]
  }
}

总结一下，如果您设法对跨多个索引“传播”的文档内容保持可靠控制，您就没问题。

但是，如果您只是让 cronjob 每天同步票证并构建您的文档，使它们就像“宣布”当前票证状态的消息一样，即：

{
  "ticket_id": A,
  "status": "OPEN"
  "timestamp": 1613428738570
}

它会引入一些我不久前讨论过的令人讨厌的复杂性 and 。

在 ElasticSearch 中存储时间相关数据

Store time related data in ElasticSearch

elasticsearch

elasticsearch-aggregation