如何在 where 子句中优化 mysql 中的日期时间比较

Question

上下文

我有一大堆 table 由外部资源更新的“文档”。当我注意到更新比我上次接触点更新时，我需要处理这些文档。不过我遇到了一些严重的性能问题。

示例代码

select count(*) from documents;

gets me back 212,494,397 documents in 1 min 15.24 sec.

select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);

which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.

select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;

notably takes about 15 minutes as well. (this was surprising to me)

问题

How do I perform the updated_at > last_indexed_at in a more reasonable time?

详情

我很确定我的查询在某种程度上是不可搜索的。不幸的是，我找不到这个查询阻止它在独立于行的基础上执行的原因。

select count(*) 
from documents 
where last_indexed_at is null or updated_at > last_indexed_at;

并没有做得更好。

也不

select count( distinct( id ) ) 
from documents 
where last_indexed_at is null or updated_at > last_indexed_at limit 1;

也不

select count( distinct( id ) ) 
from documents limit 1;

编辑：跟进请求的数据

这个问题只涉及rails项目中的一个table（谢天谢地），所以我们方便地为table.[=27=定义了rails ]

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
  `id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
  `fields` mediumtext COLLATE utf8mb4_unicode_ci,
  `created_at` datetime(6) NOT NULL,
  `updated_at` datetime(6) NOT NULL,
  `last_indexed_at` datetime(6) DEFAULT NULL,
  `deleted_at` datetime(6) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_documents_on_document_type` (`document_type`),
  KEY `index_documents_on_locale` (`locale`),
  KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

SELECT VERSION(); 找到我了 5.7.27-30-log

而且可能是最重要的，

explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);

完全明白我的意思

+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
|  1 | SIMPLE      | documents | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 208793754 |   100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+

Answer 1

添加一个覆盖索引

如果您有 INDEX(last_indexed_at, updated_at)，15 分钟的查询可能运行会快一些。（列的顺序无关紧要在这种情况下。）

假设这两列都是 table 中的列。如果是这样，那么查询必须读取每一行。（我不知道“sargable”一词是否涵盖这种情况。）

我建议的INDEX会更快，因为它是“覆盖”。通过只读索引，少了I/O.

重复15分钟大概是因为innodb_buffer_pool_size不够大，装不下整个table。所以，它是 I/O-绑定的。我的 INDEX 会更小，因此（希望）足够小以适合 buffer_pool。所以，它会更快，甚至更快运行.

慢或

OR 通常是一个可怕的减速。不过我觉得这里不重要。

如果您要将 last_indexed_at 初始化为某个旧日期（例如，'2000-01-01'），您可以去掉 COALESCE 或 OR。

另一种清理方法是

SELECT  SUM(last_indexed_at IS NULL) +
        SUM(updated_at > last_indexed_at) AS "Need indexing"
    FROM t;

我还需要索引。 SUM(boolean expression) 将表达式视为 0（假或 NULL）或 1（真）。

同时，我认为 COUNT(DISTINCT id) 与 COUNT(*) 没有任何不同。而一对 SUMs 也应该给你价值。

再一次，我依靠“覆盖”作为技巧。

“不止..”把戏

在某些情况下，您并不需要确切的数字，尤其是当它“超过某个阈值”时。

SELECT 1 FROM tbl WHERE ... LIMIT 1000,1;

如果返回“1”，则至少有 1000 行。如果它返回为空（没有返回行），则不会。

这仍然需要触及 1000 行（希望在索引中），但这比触及一百万行要好。

Answer 2

如果您使用的是最近的 MySQL 版本 (5.7+)，您可以将 generated column 添加到包含您的搜索表达式的 table，然后对该列建立索引。

ALTER TABLE t 
 ADD COLUMN needs_indexing TINYINT 
  GENERATED ALWAYS AS 
     (CASE WHEN last_indexed_at IS NULL THEN 1
           WHEN updated_at > last_indexed_at THEN 1
           ELSE 0 END) VIRTUAL;
ALTER TABLE t 
  ADD INDEX needs_indexing (needs_indexing);

这使用驱动器 space 作为索引，但不在您的 table 中。

然后您可以SELECT SUM(needs_indexing) FROM t获取符合您条件的项目数。

但是：您不必计算所有项目就知道您需要重新索引某些项目。正如您所发现的，在大型 InnoDB table 上执行 COUNT(*) 非常昂贵。你可以这样做：

SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;

您将很快从此查询中得到 1 或 0。 1 表示您至少有一行符合您的条件。

当然，您的索引工作也可以做到

SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;

或任何有意义的东西。那也快了。

Answer 3

哦！ MySQL 5.7 引入了 Generated Columns — 它为我们提供了一种索引表达式的方法！

如果你这样做：

ALTER TABLE documents
  ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
  ADD INDEX index_documents_on_dirty(dirty);

...并将查询更改为：

SELECT COUNT(*) FROM documents WHERE dirty;

...你得到了什么结果？

希望我们将评估 COALESCE(updated_at > last_indexed_at, TRUE) 的工作从 Read 时间转移到 Write 时间。

如何在 where 子句中优化 mysql 中的日期时间比较

How to optimize datetime comparisons in mysql in where clause

mysql

sql

optimization

query-optimization

上下文

示例代码

问题

详情

编辑：跟进请求的数据