MySQL 关系除法(IN AND 而不是 IN OR)实现的性能差异是什么?
What is the performance difference in MySQL relational division (IN AND instead of IN OR) implementations?
因为MySQL没有内置的关系除法运算符,程序员必须自己实现。在 this answer here.
中可以找到两个主要的实施示例
为了后代,我将在下面列出它们:
Using GROUP BY/HAVING
SELECT t.documentid
FROM TABLE t
WHERE t.termid IN (1,2,3)
GROUP BY t.documentid
HAVING COUNT(DISINCT t.termid) = 3
The caveat is that you have to use HAVING COUNT(DISTINCT because
duplicates of termid being 2 for the same documentid would be a false
positive. And the COUNT has to equal the number of termid values in
the IN clause.
Using JOINs
SELECT t.documentid
FROM TABLE t
JOIN TABLE x ON x.termid = t.termid
AND x.termid = 1
JOIN TABLE y ON y.termid = t.termid
AND y.termid = 2
JOIN TABLE z ON z.termid = t.termid
AND z.termid = 3
But this one can be a pain for handling criteria that changes a lot.
在这两种实现技术中,哪一种可以提供最好的性能?
我在JOIN
版本中做了一些改进;见下文。
为了速度,我投票赞成 JOIN 方法。我是这样决定的:
有版本 1
mysql> FLUSH STATUS;
mysql> SELECT city
-> FROM us_vch200
-> WHERE state IN ('IL', 'MO', 'PA')
-> GROUP BY city
-> HAVING count(DISTINCT state) >= 3;
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_external_lock | 2 |
| Handler_read_first | 1 |
| Handler_read_key | 2 |
| Handler_read_last | 1 |
| Handler_read_next | 4175 | -- full index scan
(etc)
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
| 1 | SIMPLE | us_vch200 | range | state_city,city_state | city_state | 769 | NULL | 4176 | Using where; Using index for group-by (scanning) |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
'Extra' 指出它决定解决 GROUP BY
并使用 INDEX(city, state)
,即使 INDEX(state, city)
可能有意义。
有,版本 2
切换到 INDEX(state, city)
会产生:
mysql> FLUSH STATUS;
mysql> SELECT city
-> FROM us_vch200 IGNORE INDEX(city_state)
-> WHERE state IN ('IL', 'MO', 'PA')
-> GROUP BY city
-> HAVING count(DISTINCT state) >= 3;
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_external_lock | 2 |
| Handler_read_key | 401 |
| Handler_read_next | 398 |
| Handler_read_rnd | 398 |
(etc)
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
| 1 | SIMPLE | us_vch200 | range | state_city,city_state | state_city | 2 | NULL | 397 | Using where; Using index; Using filesort |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
加入
mysql> SELECT x.city
-> FROM us_vch200 x
-> JOIN us_vch200 y ON y.city= x.city AND y.state = 'MO'
-> JOIN us_vch200 z ON z.city= x.city AND z.state = 'PA'
-> WHERE x.state = 'IL';
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
2 rows in set (0.00 sec)
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_external_lock | 6 |
| Handler_read_key | 86 |
| Handler_read_next | 87 |
(etc)
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
| 1 | SIMPLE | y | ref | state_city,city_state | state_city | 2 | const | 81 | Using where; Using index |
| 1 | SIMPLE | z | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index |
| 1 | SIMPLE | x | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
只需要INDEX(state, city)
。该公式的处理程序数是最小的,因此我推断它是最快的。
注意优化器是如何决定从 table 开始的,可能是由于
+-------+----------+
| state | COUNT(*) |
+-------+----------+
| IL | 221 |
| MO | 81 | -- smallest
| PA | 96 |
+-------+----------+
结论
JOIN
(没有不必要的 t
table)可能是最快的。再加上需要这个复合索引:INDEX(state, city)
.
翻译回您的用例:
city --> documentid
state --> termid
警告:YMMV 因为 documentid 和 termid 的值分布可能与我使用的测试用例完全不同。
因为MySQL没有内置的关系除法运算符,程序员必须自己实现。在 this answer here.
中可以找到两个主要的实施示例为了后代,我将在下面列出它们:
Using GROUP BY/HAVING
SELECT t.documentid FROM TABLE t WHERE t.termid IN (1,2,3) GROUP BY t.documentid HAVING COUNT(DISINCT t.termid) = 3
The caveat is that you have to use HAVING COUNT(DISTINCT because duplicates of termid being 2 for the same documentid would be a false positive. And the COUNT has to equal the number of termid values in the IN clause.
Using JOINs
SELECT t.documentid FROM TABLE t JOIN TABLE x ON x.termid = t.termid AND x.termid = 1 JOIN TABLE y ON y.termid = t.termid AND y.termid = 2 JOIN TABLE z ON z.termid = t.termid AND z.termid = 3
But this one can be a pain for handling criteria that changes a lot.
在这两种实现技术中,哪一种可以提供最好的性能?
我在JOIN
版本中做了一些改进;见下文。
为了速度,我投票赞成 JOIN 方法。我是这样决定的:
有版本 1
mysql> FLUSH STATUS;
mysql> SELECT city
-> FROM us_vch200
-> WHERE state IN ('IL', 'MO', 'PA')
-> GROUP BY city
-> HAVING count(DISTINCT state) >= 3;
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_external_lock | 2 |
| Handler_read_first | 1 |
| Handler_read_key | 2 |
| Handler_read_last | 1 |
| Handler_read_next | 4175 | -- full index scan
(etc)
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
| 1 | SIMPLE | us_vch200 | range | state_city,city_state | city_state | 769 | NULL | 4176 | Using where; Using index for group-by (scanning) |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+
'Extra' 指出它决定解决 GROUP BY
并使用 INDEX(city, state)
,即使 INDEX(state, city)
可能有意义。
有,版本 2
切换到 INDEX(state, city)
会产生:
mysql> FLUSH STATUS;
mysql> SELECT city
-> FROM us_vch200 IGNORE INDEX(city_state)
-> WHERE state IN ('IL', 'MO', 'PA')
-> GROUP BY city
-> HAVING count(DISTINCT state) >= 3;
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_external_lock | 2 |
| Handler_read_key | 401 |
| Handler_read_next | 398 |
| Handler_read_rnd | 398 |
(etc)
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
| 1 | SIMPLE | us_vch200 | range | state_city,city_state | state_city | 2 | NULL | 397 | Using where; Using index; Using filesort |
+----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+
加入
mysql> SELECT x.city
-> FROM us_vch200 x
-> JOIN us_vch200 y ON y.city= x.city AND y.state = 'MO'
-> JOIN us_vch200 z ON z.city= x.city AND z.state = 'PA'
-> WHERE x.state = 'IL';
+-------------+
| city |
+-------------+
| Springfield |
| Washington |
+-------------+
2 rows in set (0.00 sec)
mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_external_lock | 6 |
| Handler_read_key | 86 |
| Handler_read_next | 87 |
(etc)
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
| 1 | SIMPLE | y | ref | state_city,city_state | state_city | 2 | const | 81 | Using where; Using index |
| 1 | SIMPLE | z | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index |
| 1 | SIMPLE | x | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index |
+----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+
只需要INDEX(state, city)
。该公式的处理程序数是最小的,因此我推断它是最快的。
注意优化器是如何决定从 table 开始的,可能是由于
+-------+----------+
| state | COUNT(*) |
+-------+----------+
| IL | 221 |
| MO | 81 | -- smallest
| PA | 96 |
+-------+----------+
结论
JOIN
(没有不必要的 t
table)可能是最快的。再加上需要这个复合索引:INDEX(state, city)
.
翻译回您的用例:
city --> documentid
state --> termid
警告:YMMV 因为 documentid 和 termid 的值分布可能与我使用的测试用例完全不同。