优化 MySQL 大数据交集查询
Optimize MySQL Intersection Query For Large Data
这是我的 table 结构:
CREATE TABLE `instagram_user_followers_mapping` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`instagram_user_id` varchar(20) NOT NULL,
`instagram_profile_id` varchar(20) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `instagram_unique_user_follower_mapping` (`instagram_user_id`,`instagram_profile_id`),
KEY `instagram_user_followers_mapping_created_at_index` (`created_at`),
KEY `instagram_user_followers_mapping_updated_at_index` (`updated_at`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED
我在这个 table 中有超过 1 亿行。当我尝试获取两个或更多 "instagram_user_id" 之间的共同关注者时,它适用于 table 中少于 20,000 行的配置文件。但是对于超过 200 万行的配置文件,它的工作速度非常慢。我想让这些数据实时显示以供分析和报告。最终用户可能会选择配置文件的任意组合,因此创建摘要 table 在这里不是一个很好的选择。
我用来获取交集的查询是:
select instagram_profile_id, count(*) as myCount
from instagram_user_followers_mapping
where instagram_user_id IN ('1142282','346115','663620','985530')
group by instagram_profile_id HAVING myCount >= 4
IN
子句有点特别。使用此查询可以解决您的问题。我将 count(*)
更改为 count(id)
并将 IN
语句更改为 where 子句中的等于。
select instagram_profile_id, count(id) as myCount
from instagram_user_followers_mapping
where instagram_user_id = '1142282' or instagram_user_id = '346115' or instagram_user_id = '663620' or instagram_user_id = '985530'
group by instagram_profile_id HAVING myCount >= 4
The 'IN' vs 'OR' should not be an issue. The query interpreter should consider them to be the same (an EXPLAIN should demonstrate this).
Actually a copy and paste of an EXPLAIN on that query would be very useful...
Since that is a reasonably significant number of rows we are dealing with here and since your indices look sufficient, I would be looking at (2) things. First is overall db config (making sure enough ram to the innodb_buffer_pool, etc). The second (and more likely) problem is the GROUP BY being very slow. Try increasing the sort buffer type parameters, and have a look here for more ideas:
https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html
https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html
Also, if you can, try 运行 each 'WHERE instagram_user_id =' as a separate query.
In general this is not the sort of thing MySQL does wicked fast but with a bit of work you can probably get it to work for you. You might need to get a bit creative on the application side, depending on how fast you need this to be.
这应该 运行 更快,但需要构建查询:
select instagram_profile_id
from instagram_user_followers_mapping AS t
WHERE instagram_user_id = '1142282'
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '346115'
)
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '663620'
)
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '985530'
);
此公式避免了文件排序并避免为给定的 profile_id 收集所有 user_ids(反之亦然)。
innodb_buffer_pool_size
是否大于索引大小?
这是我的 table 结构:
CREATE TABLE `instagram_user_followers_mapping` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`instagram_user_id` varchar(20) NOT NULL,
`instagram_profile_id` varchar(20) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `instagram_unique_user_follower_mapping` (`instagram_user_id`,`instagram_profile_id`),
KEY `instagram_user_followers_mapping_created_at_index` (`created_at`),
KEY `instagram_user_followers_mapping_updated_at_index` (`updated_at`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED
我在这个 table 中有超过 1 亿行。当我尝试获取两个或更多 "instagram_user_id" 之间的共同关注者时,它适用于 table 中少于 20,000 行的配置文件。但是对于超过 200 万行的配置文件,它的工作速度非常慢。我想让这些数据实时显示以供分析和报告。最终用户可能会选择配置文件的任意组合,因此创建摘要 table 在这里不是一个很好的选择。
我用来获取交集的查询是:
select instagram_profile_id, count(*) as myCount
from instagram_user_followers_mapping
where instagram_user_id IN ('1142282','346115','663620','985530')
group by instagram_profile_id HAVING myCount >= 4
IN
子句有点特别。使用此查询可以解决您的问题。我将 count(*)
更改为 count(id)
并将 IN
语句更改为 where 子句中的等于。
select instagram_profile_id, count(id) as myCount
from instagram_user_followers_mapping
where instagram_user_id = '1142282' or instagram_user_id = '346115' or instagram_user_id = '663620' or instagram_user_id = '985530'
group by instagram_profile_id HAVING myCount >= 4
The 'IN' vs 'OR' should not be an issue. The query interpreter should consider them to be the same (an EXPLAIN should demonstrate this).
Actually a copy and paste of an EXPLAIN on that query would be very useful...
Since that is a reasonably significant number of rows we are dealing with here and since your indices look sufficient, I would be looking at (2) things. First is overall db config (making sure enough ram to the innodb_buffer_pool, etc). The second (and more likely) problem is the GROUP BY being very slow. Try increasing the sort buffer type parameters, and have a look here for more ideas: https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html
Also, if you can, try 运行 each 'WHERE instagram_user_id =' as a separate query.
In general this is not the sort of thing MySQL does wicked fast but with a bit of work you can probably get it to work for you. You might need to get a bit creative on the application side, depending on how fast you need this to be.
这应该 运行 更快,但需要构建查询:
select instagram_profile_id
from instagram_user_followers_mapping AS t
WHERE instagram_user_id = '1142282'
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '346115'
)
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '663620'
)
AND EXISTS
(
SELECT *
FROM instagram_user_followers_mapping
WHERE instagram_profile_id = t.instagram_profile_id
AND instagram_user_id = '985530'
);
此公式避免了文件排序并避免为给定的 profile_id 收集所有 user_ids(反之亦然)。
innodb_buffer_pool_size
是否大于索引大小?