使用多个连接加快来自 Select 查询的大型插入

Question

我正在尝试将一些 MySQL table 非规范化为一个新的 table，我可以用它来加速一些具有大量业务逻辑的复杂查询。我遇到的问题是我需要将 230 万条记录添加到新的 table 中，为此我需要从多个 table 中提取数据并进行一些转换。这是我的查询（名称已更改）

INSERT INTO database_name.log_set_logs
(offload_date, vehicle, jurisdiction, baselog_path, path,
 baselog_index_guid, new_location, log_set_name, index_guid) 
 (
select  STR_TO_DATE(logset_logs.offload_date, '%Y.%m.%d') as offload_date,
        logset_logs.vehicle, jurisdiction, baselog_path, path,
        baselog_trees.baselog_index_guid, new_location, logset_logs.log_set_name,
        logset_logs.index_guid
    from  
    (
        SELECT  SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 7), '/', -1) as offload_date,
                SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle,
                SUBSTRING_INDEX(path, '/', 9) as baselog_path, index_guid,
                path, log_set_name
            FROM  database_name.baselog_and_amendment_guid_to_path_mappings 
    ) logset_logs
    left join  database_name.log_trees baselog_trees
         ON baselog_trees.original_location = logset_logs.baselog_path
    left join  database_name.baselog_offload_location location
         ON location.baselog_index_guid = baselog_trees.baselog_index_guid);

查询本身有效，因为我能够运行它使用 log_set_name 上的过滤器，但是该过滤器的条件仅适用于不到总记录的 1%，因为其中一个值for log_set_name 中有 220 万条记录，这是大部分记录。因此，我无法使用其他方法将此查询分解为我所看到的更小的块。问题是查询对其余 220 万条记录的运行花费的时间太长，最终在几个小时后超时，然后事务回滚并且没有任何内容添加到新的 table代表220万条记录；只有 10 万条记录能够被处理，那是因为我可以添加一个过滤器来表示 log_set_name != 'value with the 2.2 million records'.

有没有办法让这个查询更高效？我是不是想一次做太多的连接，也许我应该在他们自己的查询中填充行的列？或者有什么方法可以分页这种类型的查询，以便 MySQL 分批执行它？我已经摆脱了 log_set_logs table 上的所有索引，因为我读到这些会减慢插入速度。我还将我的 RDS 实例连接到一个 db.r4.4xlarge 写入节点。我也在使用 MySQL Workbench 所以我将它的所有超时值都增加到最大值，使它们全部为 9。为了让我将 1% 的记录放入新的 table，所有这三个步骤都有帮助并且是必要的，但它仍然不足以在不超时的情况下获得 220 万条记录。感谢任何见解，因为我不擅长从 select.

进行这种类型的批量插入

'CREATE TABLE `log_set_logs` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `purged` tinyint(1) NOT NULL DEFAUL,
  `baselog_path` text,
  `baselog_index_guid` varchar(36) DEFAULT NULL,
  `new_location` text,
  `offload_date` date NOT NULL,
  `jurisdiction` varchar(20) DEFAULT NULL,
  `vehicle` varchar(20) DEFAULT NULL,
  `index_guid` varchar(36) NOT NULL,
  `path` text NOT NULL,
  `log_set_name` varchar(60) NOT NULL,
  `protected_by_retention_condition_1` tinyint(1) NOT NULL DEFAULT ''1'',
  `protected_by_retention_condition_2` tinyint(1) NOT NULL DEFAULT ''1'',
  `protected_by_retention_condition_3` tinyint(1) NOT NULL DEFAULT ''1'',
  `protected_by_retention_condition_4` tinyint(1) NOT NULL DEFAULT ''1'',
  `general_comments_about_this_log` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1736707 DEFAULT CHARSET=latin1'


'CREATE TABLE `baselog_and_amendment_guid_to_path_mappings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `path` text NOT NULL,
  `index_guid` varchar(36) NOT NULL,
  `log_set_name` varchar(60) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `log_set_name_index` (`log_set_name`),
  KEY `path_index` (`path`(42))
) ENGINE=InnoDB AUTO_INCREMENT=2387821 DEFAULT CHARSET=latin1'

...

'CREATE TABLE `baselog_offload_location` (
  `baselog_index_guid` varchar(36) NOT NULL,
  `jurisdiction` varchar(20) NOT NULL,
  KEY `baselog_index` (`baselog_index_guid`),
  KEY `jurisdiction` (`jurisdiction`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1'


'CREATE TABLE `log_trees` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `baselog_index_guid` varchar(36) DEFAULT NULL,
  `original_location` text NOT NULL, -- This is what I have to join everything on and since it's text I cannot index it and the largest value is above 255 characters so I cannot change it to a vachar then index it either.
  `new_location` text,
  `distcp_returncode` int(11) DEFAULT NULL,
  `distcp_job_id` text,
  `distcp_stdout` text,
  `distcp_stderr` text,
  `validation_attempt` int(11) NOT NULL DEFAULT ''0'',
  `validation_result` tinyint(1) NOT NULL DEFAULT ''0'',
  `archived` tinyint(1) NOT NULL DEFAULT ''0'',
  `archived_at` timestamp NULL DEFAULT NULL,
  `created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `dir_exists` tinyint(1) NOT NULL DEFAULT ''0'',
  `random_guid` tinyint(1) NOT NULL DEFAULT ''0'',
  `offload_date` date NOT NULL,
  `vehicle` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `baselog_index_guid` (`baselog_index_guid`)
) ENGINE=InnoDB AUTO_INCREMENT=1028617 DEFAULT CHARSET=latin1'

Answer 1

baselog_offload_location还没有PRIMARY KEY；怎么了？
GUIDs/UUIDs 可能会非常低效。部分解决方案是将它们转换为 BINARY(16) 以缩小它们。更多详情：http:///localhost/rjweb/mysql/doc.php/uuid ；（MySQL8.0有类似功能。）
如果你有一个单独的（可选的冗余）列用于 vehicle 而不是需要做
可能会更有效率
```
  SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle
```
为什么 JOIN baselog_offload_location？三个好像都没有引用那个table中的栏目。如果有，请务必对它们进行限定，以便我们知道它们在哪里。最好使用短别名。
baselog_index_guid 上缺少索引可能对性能至关重要。
请为 INSERT 中的 SELECT 和原始（慢速）查询提供 EXPLAIN SELECT ...。
SELECT MAX(LENGTH(original_location)) FROM .. -- 看看它是否真的太大而无法索引。您使用的 MySQL 是什么版本？最近限制增加了。
对于上面的项目，我们可以讨论一下'hash'。
“分页查询”。我称之为“分块”。参见 http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks。那是关于删除，但它可以适应 INSERT .. SELECT，因为您想“分块”select。如果你选择分块，Javier 的评论就没有意义了。您的代码将对 select 进行分块，因此对插入进行批处理：
```
  Loop:
      INSERT .. SELECT .. -- of up to 1000 rows (see link)
  End loop
```

使用多个连接加快来自 Select 查询的大型插入

Speed Up A Large Insert From Select Query With Multiple Joins

mysql

query-optimization

mysql-workbench

amazon-rds