优化 MySQL 大数据交集查询

Question

这是我的 table 结构：

CREATE TABLE `instagram_user_followers_mapping` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`instagram_user_id` varchar(20) NOT NULL,
`instagram_profile_id` varchar(20) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `instagram_unique_user_follower_mapping` (`instagram_user_id`,`instagram_profile_id`),
KEY `instagram_user_followers_mapping_created_at_index` (`created_at`),
KEY `instagram_user_followers_mapping_updated_at_index` (`updated_at`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED

我在这个 table 中有超过 1 亿行。当我尝试获取两个或更多 "instagram_user_id" 之间的共同关注者时，它适用于 table 中少于 20,000 行的配置文件。但是对于超过 200 万行的配置文件，它的工作速度非常慢。我想让这些数据实时显示以供分析和报告。最终用户可能会选择配置文件的任意组合，因此创建摘要 table 在这里不是一个很好的选择。

我用来获取交集的查询是：

select instagram_profile_id, count(*) as myCount 
from instagram_user_followers_mapping 
where instagram_user_id IN ('1142282','346115','663620','985530') 
group by instagram_profile_id HAVING myCount >= 4

Answer 1

IN 子句有点特别。使用此查询可以解决您的问题。我将 count(*) 更改为 count(id) 并将 IN 语句更改为 where 子句中的等于。

select instagram_profile_id, count(id) as myCount 
from instagram_user_followers_mapping 
where instagram_user_id = '1142282' or instagram_user_id = '346115' or instagram_user_id = '663620' or instagram_user_id = '985530'
group by instagram_profile_id HAVING myCount >= 4

Answer 2

The 'IN' vs 'OR' should not be an issue. The query interpreter should consider them to be the same (an EXPLAIN should demonstrate this).

Actually a copy and paste of an EXPLAIN on that query would be very useful...

Since that is a reasonably significant number of rows we are dealing with here and since your indices look sufficient, I would be looking at (2) things. First is overall db config (making sure enough ram to the innodb_buffer_pool, etc). The second (and more likely) problem is the GROUP BY being very slow. Try increasing the sort buffer type parameters, and have a look here for more ideas: https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html

Also, if you can, try 运行 each 'WHERE instagram_user_id =' as a separate query.

In general this is not the sort of thing MySQL does wicked fast but with a bit of work you can probably get it to work for you. You might need to get a bit creative on the application side, depending on how fast you need this to be.

Answer 3

这应该运行更快，但需要构建查询：

select  instagram_profile_id
    from  instagram_user_followers_mapping AS t
    WHERE  instagram_user_id = '1142282'
      AND  EXISTS
        (
        SELECT  *
            FROM  instagram_user_followers_mapping
            WHERE  instagram_profile_id = t.instagram_profile_id
              AND  instagram_user_id = '346115' 
        )
      AND  EXISTS 
        (
        SELECT  *
            FROM  instagram_user_followers_mapping
            WHERE  instagram_profile_id = t.instagram_profile_id
              AND  instagram_user_id = '663620' 
        )
      AND  EXISTS 
        (
        SELECT  *
            FROM  instagram_user_followers_mapping
            WHERE  instagram_profile_id = t.instagram_profile_id
              AND  instagram_user_id = '985530' 
        );

此公式避免了文件排序并避免为给定的 profile_id 收集所有 user_ids（反之亦然）。

innodb_buffer_pool_size是否大于索引大小？

优化 MySQL 大数据交集查询

Optimize MySQL Intersection Query For Large Data

mysql

optimization

reporting

bigdata