MySQL 使用 GROUP BY 的查询非常慢

Question

我有一个使用以下架构的数据库：

CREATE TABLE IF NOT EXISTS `sessions` (
  `starttime` datetime NOT NULL,
  `ip` varchar(15) NOT NULL default '',
  `country_name` varchar(45) default '',
  `country_iso_code` varchar(2) default '',
  `org` varchar(128) default '',
  KEY (`ip`),
  KEY (`starttime`),
  KEY (`country_name`)
);

（实际 table 包含更多列；我只包含了我查询的列。）引擎是 InnoDB。

如您所见，有 3 个索引 - ip、starttime 和 country_name。

table 非常大 - 它包含大约 150 万行。我正在运行对其进行各种查询，试图提取一个月的信息（在下面的示例中为 2018 年 8 月）。

这样的查询

SELECT
  UNIX_TIMESTAMP(starttime) as time_sec,
  country_iso_code AS metric,
  COUNT(country_iso_code) AS value
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY metric;

相当慢但可以忍受（几十秒），尽管 country_iso_code.

上没有索引

(忽略SELECT中的第一件事；我知道它似乎没有意义，但是在使用查询结果的工具中需要它。同样，忽略使用FROM_UNIXTIME() 而不是日期字符串；这部分查询是自动生成的，我无法控制它。）

但是，像这样的查询

SELECT
  country_name AS Country,
  COUNT(country_name) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Country;

慢得无法忍受 - 我让它运行了大约半个小时，没有得到任何结果就放弃了。

来自 EXPLAIN 的结果：

+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys                      | key          | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,country_name | country_name | 138     | NULL | 14771687 |    35.81 | Using where |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+

到底是什么问题？我应该索引其他东西吗？也许是 (starttime, country_name) 上的复合索引？我读过 this guide 但也许我误解了它？

以下是一些其他同样缓慢且可能遇到相同问题的查询：

查询#2：

SELECT
  ip AS IP,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

来自 EXPLAIN 的结果：

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771780 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

查询 #3：

SELECT
  org AS Organization,
  COUNT(org) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Organization;

来自 EXPLAIN 的结果：

+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys             | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,org | org  | 387     | NULL | 14771800 |    35.81 | Using where |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+

查询 #4：

SELECT
  ip AS IP,
  country_name AS Country,
  city_name AS City,
  org AS Organization,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

来自 EXPLAIN 的结果：

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771914 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

Answer 1

一般来说，查询的形式是

  SELECT column, COUNT(column)
    FROM tbl
   WHERE datestamp >= a AND datestamp <= b
   GROUP BY column

当 table 在 (datestamp, column) 上有复合索引时，

表现最佳。为什么？它们可以通过 索引扫描 而不是需要读取 table.

的所有行来满足

换句话说，可以通过随机访问索引（到日期戳的第一个值）来定位查询的第一个相关行。然后，MySQL 可以顺序读取索引并计算 column 中的各种值，直到它命中最后一个相关行。无需阅读实际的 table；仅从索引就可以满足查询。这使它更快。

UPDATE TABLE tbl ADD INDEX date_col (datestamp, column);

为您创建索引。

注意两件事。一：单列索引不一定有助于聚合查询性能。

二：在不查看整个查询的情况下，很难猜出正确的索引用于进行索引扫描。简化的查询通常会导致过度简化的索引。

Answer 2

更好...

请注意，您没有PRIMARY KEY；那是调皮的。拥有 PK 不会从本质上提高性能，但拥有 PK 从 starttime 开始会。让我们这样做：

CREATE TABLE IF NOT EXISTS `sessions` (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,   -- note
  `starttime` datetime NOT NULL,
  `ip` varchar(39) NOT NULL CHARACTER SET ascii default '',  -- note
  `country_name` varchar(45) default '',
  `country_iso_code` char(2) CHARACTER SET ascii  default '',  -- note
  `org` varchar(128) default '',
  PRIMARY KEY(starttime, id)  -- in this order
  INDEX(id)                   -- to keep AUTO_INCREMENT happy
  -- The rest are unnecessary for the queries in question:
  KEY (`ip`),
  KEY (`starttime`),
  KEY (`country_name`)
) ENGINE=InnoDB;        -- just in case you are accidentally getting MyISAM

为什么？这将利用 PK 与数据的 "clustering"。这样，将只扫描时间范围内的 table 的一部分。而且索引和数据之间不会有反弹。而且你不需要很多索引来有效地处理所有情况。

IPv6 最多需要 39 个字节。请注意 VARCHAR 不会让您进行任何范围 (CDR) 测试。我可以进一步讨论你喜欢的。

MySQL 使用 GROUP BY 的查询非常慢

MySQL query using GROUP BY is extremely slow

mysql

aggregate-functions

query-performance