使用多个连接和分组优化 SQL 查询 (Postgres 9.3)
Optimizing SQL query with multiple joins and grouping (Postgres 9.3)
我浏览了其他一些帖子并设法使我的查询 运行 快了一点。但是,我对如何进一步优化此查询一头雾水。我打算在一个网站上使用它,它会在加载页面时执行查询,但是 5.5 秒对于等待应该更简单的事情来说太长了。最大的 table 有大约 4,000,000 行,其他的每行大约有 400,000 行。
Table结构
匹配
id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT
团队
matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT
冠军
id INTEGER PRIMARY KEY,
version TEXT,
name TEXT
项目
id INTEGER PRIMARY KEY,
name TEXT
参与者
PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT
查询
select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
EXPLAIN ANALYZE
的输出:http://explain.depesz.com/s/ZYX
到目前为止我做了什么
我在 match.region
、participant.championid
上创建了单独的索引,并在团队 where winner = 'True'
上创建了部分索引(因为那只是我感兴趣的)。请注意 enable_seqscan = on
因为当它关闭时查询非常慢。本质上,我想要得到的结果是这样的:
Champion |item0 | item1 | ... | item5
champ_name | num | num1 | ... | num5
...
由于我在数据库设计方面仍是初学者,如果我的整体 table 设计存在缺陷,我不会感到惊讶。不过,我仍然倾向于绝对低效的查询。我玩过内连接和左连接——虽然没有显着差异。此外,匹配需要 bigint
(或大于 integer
,因为它太小了)。
我会尝试使用
count(*) 过滤器(其中 item0 = '3285' )as it0
用你的计数代替总和。
此外,您为什么要加入最后 2 个表,然后使用 where 语句。这违背了目的,常规的内部连接更快
select champion.name,
count(*) filter( where participant.item0 = 3285) as it0,
count(*) filter( where participant.item1 = 3285) as it1,
count(*) filter( where participant.item2 = 3285) as it2,
count(*) filter( where participant.item3 = 3285) as it3,
count(*) filter( where participant.item4 = 3285) as it4,
count(*) filter( where participant.item5 = 3285) as it5
from participant
join champion on champion.id = participant.championid
join team on team.matchid = participant.matchid and team.id = participant.teamid
join match on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
数据库设计
我建议:
CREATE TABLE matchversion (
matchversion_id int PRIMARY KEY
, matchversion text UNIQUE NOT NULL
);
CREATE TABLE matchtype (
matchtype_id int PRIMARY KEY
, matchtype text UNIQUE NOT NULL
);
CREATE TABLE region (
region_id int PRIMARY KEY
, region text NOT NULL
);
CREATE TABLE match (
match_id bigint PRIMARY KEY
, region_id int REFERENCES region
, matchtype_id int REFERENCES matchtype
, matchversion_id int REFERENCES matchversion
);
CREATE TABLE team (
match_id bigint REFERENCES match
, team_id integer -- better name !
, winner boolean -- ?!
, PRIMARY KEY(match_id, team_id)
);
CREATE TABLE champion (
champion_id int PRIMARY KEY
, version text
, name text
);
CREATE TABLE participant (
participant_id serial PRIMARY KEY -- use proper name !
, champion_id int NOT NULL REFERENCES champion
, match_id bigint NOT NULL REFERENCES match -- this FK might be redundant
, team_id int
, magic_damage_dealt_to_champions real
, damage_dealt_to_champions real
, item0 text -- or integer ??
, item1 text
, item2 text
, item3 text
, item4 text
, item5 text
, highest_achieved_season_tier text -- integer ??
, FOREIGN KEY (match_id, team_id) REFERENCES team
);
更多规范化以获得更小的 tables 和索引以及更快的访问。为matchversion
、matchtype
和region
创建lookup-tables,只在match
中写一个小整数ID。
似乎列 participant.item0
.. item5
和 highestAchievedSeasonTier
可能是 integer
,但定义为 text
?
列team.winner
似乎是boolean
,但被定义为text
。
我还更改了列的顺序以提高效率。详情:
- Calculating and saving space in PostgreSQL
查询
基于上述修改并针对 Postgres 9.3:
SELECT c.name, *
FROM (
SELECT p.champion_id
, count(p.item0 = '3285' OR NULL) AS it0
, count(p.item1 = '3285' OR NULL) AS it1
, count(p.item2 = '3285' OR NULL) AS it2
, count(p.item3 = '3285' OR NULL) AS it3
, count(p.item4 = '3285' OR NULL) AS it4
, count(p.item5 = '3285' OR NULL) AS it5
FROM matchversion mv
CROSS JOIN matchtype mt
JOIN match m USING (matchtype_id, matchversion_id)
JOIN team t USING (match_id)
JOIN participant p USING (match_id, team_id)
WHERE mv.matchversion = '5.14'
AND mt.matchtype = 'RANKED_SOLO_5x5'
AND t.winner = 'True' -- should be boolean
GROUP BY p.champion_id
) p
JOIN champion c USING (champion_id); -- probably just JOIN ?
由于champion.name
没有定义UNIQUE
,所以很可能错了到GROUP BY
吧。这也是低效的。请改用 participant.championid
(如果您需要结果中的名称,稍后加入 champion
)。
LEFT JOIN
的所有实例都是毫无意义的,因为无论如何您在左侧有谓词 table 和/或使用 GROUP BY
中的列。
AND
-ed WHERE
条件周围的括号是不需要的。
在 Postgres 9.4 或更高版本中,您可以改用新的聚合 FILTER
语法。详细信息和备选方案:
- How can I simplify this game statistics query?
索引
您已有的 team
上的部分索引应如下所示,以允许 index-only 扫描:
CREATE INDEX on team (matchid, id) WHERE winner -- boolean
但据我所知,您可能只需将 winner
列添加到 participant
并完全删除 table team
(除非还有更多内容) .
此外,该索引 没有 会有很大帮助,因为(从您的查询计划中得知)table 有 800k 行,其中一半符合条件:
rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
当您有更多不同的匹配类型和匹配版本时,match
上的这个索引将(稍后)提供更多帮助:
CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
不过,虽然 400k 行中有 100k 行符合条件,但索引仅对仅索引扫描有用。否则,顺序扫描会更快。一个索引通常支付大约选择 table 或更少的 5%。
你的主要问题是你显然运行一个测试用例几乎不现实的数据分布。使用更多选择性谓词,索引将更容易使用。
放在一边
确保你有 configured basic Postgres settings like random_page_cost
or work_mem
etc.
enable_seqscan = on
不言而喻。这只是为了调试或在本地作为最后手段的绝望措施而关闭。
我浏览了其他一些帖子并设法使我的查询 运行 快了一点。但是,我对如何进一步优化此查询一头雾水。我打算在一个网站上使用它,它会在加载页面时执行查询,但是 5.5 秒对于等待应该更简单的事情来说太长了。最大的 table 有大约 4,000,000 行,其他的每行大约有 400,000 行。
Table结构
匹配
id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT
团队
matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT
冠军
id INTEGER PRIMARY KEY,
version TEXT,
name TEXT
项目
id INTEGER PRIMARY KEY,
name TEXT
参与者
PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT
查询
select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
EXPLAIN ANALYZE
的输出:http://explain.depesz.com/s/ZYX
到目前为止我做了什么
我在 match.region
、participant.championid
上创建了单独的索引,并在团队 where winner = 'True'
上创建了部分索引(因为那只是我感兴趣的)。请注意 enable_seqscan = on
因为当它关闭时查询非常慢。本质上,我想要得到的结果是这样的:
Champion |item0 | item1 | ... | item5
champ_name | num | num1 | ... | num5
...
由于我在数据库设计方面仍是初学者,如果我的整体 table 设计存在缺陷,我不会感到惊讶。不过,我仍然倾向于绝对低效的查询。我玩过内连接和左连接——虽然没有显着差异。此外,匹配需要 bigint
(或大于 integer
,因为它太小了)。
我会尝试使用 count(*) 过滤器(其中 item0 = '3285' )as it0
用你的计数代替总和。
此外,您为什么要加入最后 2 个表,然后使用 where 语句。这违背了目的,常规的内部连接更快
select champion.name,
count(*) filter( where participant.item0 = 3285) as it0,
count(*) filter( where participant.item1 = 3285) as it1,
count(*) filter( where participant.item2 = 3285) as it2,
count(*) filter( where participant.item3 = 3285) as it3,
count(*) filter( where participant.item4 = 3285) as it4,
count(*) filter( where participant.item5 = 3285) as it5
from participant
join champion on champion.id = participant.championid
join team on team.matchid = participant.matchid and team.id = participant.teamid
join match on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
数据库设计
我建议:
CREATE TABLE matchversion (
matchversion_id int PRIMARY KEY
, matchversion text UNIQUE NOT NULL
);
CREATE TABLE matchtype (
matchtype_id int PRIMARY KEY
, matchtype text UNIQUE NOT NULL
);
CREATE TABLE region (
region_id int PRIMARY KEY
, region text NOT NULL
);
CREATE TABLE match (
match_id bigint PRIMARY KEY
, region_id int REFERENCES region
, matchtype_id int REFERENCES matchtype
, matchversion_id int REFERENCES matchversion
);
CREATE TABLE team (
match_id bigint REFERENCES match
, team_id integer -- better name !
, winner boolean -- ?!
, PRIMARY KEY(match_id, team_id)
);
CREATE TABLE champion (
champion_id int PRIMARY KEY
, version text
, name text
);
CREATE TABLE participant (
participant_id serial PRIMARY KEY -- use proper name !
, champion_id int NOT NULL REFERENCES champion
, match_id bigint NOT NULL REFERENCES match -- this FK might be redundant
, team_id int
, magic_damage_dealt_to_champions real
, damage_dealt_to_champions real
, item0 text -- or integer ??
, item1 text
, item2 text
, item3 text
, item4 text
, item5 text
, highest_achieved_season_tier text -- integer ??
, FOREIGN KEY (match_id, team_id) REFERENCES team
);
更多规范化以获得更小的 tables 和索引以及更快的访问。为
matchversion
、matchtype
和region
创建lookup-tables,只在match
中写一个小整数ID。似乎列
participant.item0
..item5
和highestAchievedSeasonTier
可能是integer
,但定义为text
?列
team.winner
似乎是boolean
,但被定义为text
。我还更改了列的顺序以提高效率。详情:
- Calculating and saving space in PostgreSQL
查询
基于上述修改并针对 Postgres 9.3:
SELECT c.name, *
FROM (
SELECT p.champion_id
, count(p.item0 = '3285' OR NULL) AS it0
, count(p.item1 = '3285' OR NULL) AS it1
, count(p.item2 = '3285' OR NULL) AS it2
, count(p.item3 = '3285' OR NULL) AS it3
, count(p.item4 = '3285' OR NULL) AS it4
, count(p.item5 = '3285' OR NULL) AS it5
FROM matchversion mv
CROSS JOIN matchtype mt
JOIN match m USING (matchtype_id, matchversion_id)
JOIN team t USING (match_id)
JOIN participant p USING (match_id, team_id)
WHERE mv.matchversion = '5.14'
AND mt.matchtype = 'RANKED_SOLO_5x5'
AND t.winner = 'True' -- should be boolean
GROUP BY p.champion_id
) p
JOIN champion c USING (champion_id); -- probably just JOIN ?
由于
champion.name
没有定义UNIQUE
,所以很可能错了到GROUP BY
吧。这也是低效的。请改用participant.championid
(如果您需要结果中的名称,稍后加入champion
)。LEFT JOIN
的所有实例都是毫无意义的,因为无论如何您在左侧有谓词 table 和/或使用GROUP BY
中的列。AND
-edWHERE
条件周围的括号是不需要的。在 Postgres 9.4 或更高版本中,您可以改用新的聚合
FILTER
语法。详细信息和备选方案:- How can I simplify this game statistics query?
索引
您已有的 team
上的部分索引应如下所示,以允许 index-only 扫描:
CREATE INDEX on team (matchid, id) WHERE winner -- boolean
但据我所知,您可能只需将 winner
列添加到 participant
并完全删除 table team
(除非还有更多内容) .
此外,该索引 没有 会有很大帮助,因为(从您的查询计划中得知)table 有 800k 行,其中一半符合条件:
rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
当您有更多不同的匹配类型和匹配版本时,match
上的这个索引将(稍后)提供更多帮助:
CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
不过,虽然 400k 行中有 100k 行符合条件,但索引仅对仅索引扫描有用。否则,顺序扫描会更快。一个索引通常支付大约选择 table 或更少的 5%。
你的主要问题是你显然运行一个测试用例几乎不现实的数据分布。使用更多选择性谓词,索引将更容易使用。
放在一边
确保你有 configured basic Postgres settings like random_page_cost
or work_mem
etc.
enable_seqscan = on
不言而喻。这只是为了调试或在本地作为最后手段的绝望措施而关闭。