从 SQL 服务器检索数据并根据分组将结果连接到行上
Retrieve data from SQL Server and concatenate results over rows based on grouping
几天来我一直在研究一个问题,并最终找到了适合我的解决方案。如果此解决方案对其他人有用,我将提出一个问题并自己回答。
我对包含超过 100 万条记录的大型 SQL 服务器数据库具有只读访问权限。数据库中的某些 table 通过查找 table 以多对多关系链接。为简化起见,table可以如下图所示:
table names
|-----------|
| id | name |
|----|------|
| 1 | dave |
| 2 | phil |
| 3 | john | table foods_relationship table clothes_relationship
| 4 | pete | |--------------------------| |----------------------------|
|-----------| | id | names_id | foods_id | | id | names_id | clothes_id |
|----|----------|----------| |----|----------|------------|
table foods | 1 | 1 | 1 | | 1 | 1 | 1 |
|---------------| | 2 | 1 | 3 | | 2 | 1 | 3 |
| id | food | | 3 | 1 | 4 | | 3 | 1 | 4 |
|----|----------| | 4 | 2 | 2 | | 4 | 2 | 2 |
| 1 | beef | | 5 | 2 | 3 | | 5 | 2 | 3 |
| 2 | tomatoes | | 6 | 2 | 4 | | 6 | 2 | 4 |
| 3 | bacon | | 7 | 2 | 5 | | 7 | 3 | 1 |
| 4 | cheese | | 8 | 3 | 3 | | 8 | 3 | 3 |
| 5 | apples | | 9 | 3 | 5 | | 9 | 3 | 5 |
|---------------| | 10 | 4 | 1 | | 10 | 4 | 2 |
| 11 | 4 | 2 | | 11 | 4 | 4 |
table clothes | 12 | 4 | 3 | | 12 | 4 | 5 |
|---------------| | 13 | 4 | 5 | |----------------------------|
| id | clothes | |--------------------------|
|----|----------|
| 1 | trousers |
| 2 | shorts |
| 3 | shirt |
| 4 | socks |
| 5 | jumper |
| 6 | jacket |
|---------------|
可以使用以下 SQL 重新创建 table(改编自 MySQL 数据库,因此可能需要稍作调整才能在 SQL 服务器中工作):
CREATE TABLE `clothes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`clothes` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes` (`id`, `clothes`)
VALUES
(1,'trousers'),
(2,'shorts'),
(3,'shirt'),
(4,'socks'),
(5,'jumper'),
(6,'jacket');
CREATE TABLE `clothes_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`clothes_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes_relationships` (`id`, `names_id`, `clothes_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,3,1),
(8,3,3),
(9,3,5),
(10,4,2),
(11,4,4),
(12,4,5);
CREATE TABLE `food_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`foods_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `food_relationships` (`id`, `names_id`, `foods_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,2,5),
(8,3,3),
(9,3,5),
(10,4,1),
(11,4,2),
(12,4,3),
(13,4,5);
CREATE TABLE `foods` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`food` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `foods` (`id`, `food`)
VALUES
(1,'beef'),
(2,'tomatoes'),
(3,'bacon'),
(4,'cheese'),
(5,'apples');
CREATE TABLE `names` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `names` (`id`, `name`)
VALUES
(1,'dave'),
(2,'phil'),
(3,'john'),
(4,'pete');
我想查询数据库并以某种方式获得以下输出:
|-------------------------------------------------------------|
| name | food | clothes |
|------|------------------------------|-----------------------|
| dave | beef,cheese,bacon | trousers,socks,shirt |
| john | apples,bacon | jumper,shirt,trousers |
| pete | beef,apples,bacon,tomatoes | shorts,jumper,socks |
| phil | bacon,tomatoes,apples,cheese | shirt,shorts,socks |
|-------------------------------------------------------------|
但是,运行 SELECT 查询将“名称”table 连接到其他 table 中的一个或两个(通过相应的查找 tables) 每个名称会产生多行。例如:
SELECT
names.name,
foods.food
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id;
...产生以下结果集:
|-----------------|
| name | food |
|------|----------|
| dave | beef |
| dave | bacon |
| dave | cheese |
| phil | tomatoes |
| phil | bacon |
| phil | cheese |
| phil | apples |
| john | bacon |
| john | apples |
| pete | beef |
| pete | tomatoes |
| pete | bacon |
| pete | apples |
|-----------------|
如果 SELECT 从两个 table 查询 returns 数据,问题会更加复杂:
SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id;
|-----------------------------|
| name | food | clothes |
|------|----------|-----------|
| dave | beef | trousers |
| dave | beef | shirt |
| dave | beef | socks |
| dave | bacon | trousers |
| dave | bacon | shirt |
| dave | bacon | socks |
| dave | cheese | trousers |
| dave | cheese | shirt |
| dave | cheese | socks |
| phil | tomatoes | shorts |
| phil | tomatoes | shirt |
| phil | tomatoes | socks |
| phil | bacon | shorts |
| phil | bacon | shirt |
| phil | bacon | socks |
| phil | cheese | shorts |
| phil | cheese | shirt |
| phil | cheese | socks |
| phil | apples | shorts |
| phil | apples | shirt |
| phil | apples | socks |
| ...
| etc.
问题是,如何查询 SQL 服务器数据库以检索所有数据,但将其处理为每人只有一行?
如果数据库是 MySQL,解决方案会相对简单,因为 MySQL 有一个连接行的 GROUP_CONCAT 函数。因此,对于其中一个 table,我可以使用:
SELECT
names.name,
GROUP_CONCAT(foods.food)
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name);
...给予:
name food
dave beef,cheese,bacon
john apples,bacon
pete beef,apples,bacon,tomatoes
phil bacon,tomatoes,apples,cheese
要从“姓名”和“衣服”table 中获取等效数据,我可以使用类似的方法:
SELECT
temp_foods_table.name AS 'name',
temp_foods_table.food AS 'food',
temp_clothes_table.clothes AS 'clothes'
FROM
(
SELECT
names.name,
GROUP_CONCAT(foods.food) AS 'food'
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name)
) AS temp_foods_table
LEFT JOIN
(
SELECT
names.name,
GROUP_CONCAT(clothes.clothes) AS 'clothes'
FROM
names
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id
GROUP BY (names.name)
) AS temp_clothes_table
ON temp_foods_table.name = temp_clothes_table.name;
...给出以下结果:
name food clothes
dave beef,cheese,bacon trousers,socks,shirt
john apples,bacon jumper,shirt,trousers
pete beef,apples,bacon,tomatoes shorts,jumper,socks
phil bacon,tomatoes,apples,cheese shirt,shorts,socks
但是SQLSERVER中的情况就显得少straight-forward了。对于单个 table 有一些在线建议的解决方案,其中包括使用常见的 table 表达式或 FOR XML PATH。然而,所有的解决方案似乎都有缺点,并且给人的印象是它们是 work-arounds 而不是 specifically-designed 的特征。每个建议的解决方案都有一些弱点(例如,FOR XML PATH 解决方案假定文本是 XML,因此文本中包含的特殊字符可能会导致问题)。此外,一些评论者表示担心此类 work-arounds 基于未记录或已弃用的功能,因此在 long-term.
中可能不可靠
因此,我决定不把自己束缚在 SQL 结中,而是使用 Python 和 Pandas 处理数据 post-retrieval。无论如何,我总是将数据传输到 Pandas 数据框以进行绘图和分析,因此这不是一个重大的不便。为了连接多个列的数据,我使用了 groupby()。但是,由于有两个 many-to-many table,每列中都有重复项,因此,最终的连接字符串包含所有这些重复项。为了只有唯一值,我使用了 Python 集(根据定义,它只能包含唯一值)。这种方法的唯一潜在缺点是字符串的顺序无法保持,但就我的情况而言,这不是问题。最终的 Python 解决方案如下所示:
导入必要的库:
>>> import pandas as pd
>>> import pymssql
>>> import getpass
输入连接数据库所需的详细信息:
>>> myServer = input("Enter server address: ")
>>> myUser = input("Enter username: ")
>>> myPwd = getpass.getpass("Enter password: ")
创建连接:
>>> myConnection = pymssql.connect(server=myServer, user=myUser, password=myPwd, port='1433')
定义查询以检索必要的数据:
>>> myQuery = """SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id """
运行 查询,将结果放入数据框并关闭连接:
>>> myLatestData = pd.io.sql.read_sql(myQuery, con=myConnection)
>>> myConnection.close()
连接多行中的字符串并删除重复项:
>>> tempDF = tempDF.groupby('name').agg(lambda col: ','.join(set(col)))
打印最终数据帧:
>>> print(tempDF)
name food clothes
dave beef,bacon,cheese socks,trousers,shirt
john bacon,apples jumper,trousers,shirt
pete tomatoes,beef,bacon,apples socks,jumper,shorts
phil tomatoes,bacon,cheese,apples socks,shorts,shirt
对我来说,这个解决方案比尝试在 SQL 查询中进行所有数据处理更直观。希望这对其他人有帮助。
如果是MS-Sql服务器..
您可以使用STUFF功能。例如
声明@Heroes TABLE (
[英雄名] VARCHAR(20)
)
插入@Heroes ( [HeroName] )
值('Superman'),('Batman'),('Ironman'),('Wolverine')
SELECT 东西((SELECT ',' + [HeroName]
来自@Heroes
按 [HeroName] 订购
FOR XML PATH('')), 1, 1, '') AS [输出]
输出
蝙蝠侠、钢铁侠、超人、金刚狼
我认为这应该可以回答您的问题。
谢谢
几天来我一直在研究一个问题,并最终找到了适合我的解决方案。如果此解决方案对其他人有用,我将提出一个问题并自己回答。
我对包含超过 100 万条记录的大型 SQL 服务器数据库具有只读访问权限。数据库中的某些 table 通过查找 table 以多对多关系链接。为简化起见,table可以如下图所示:
table names
|-----------|
| id | name |
|----|------|
| 1 | dave |
| 2 | phil |
| 3 | john | table foods_relationship table clothes_relationship
| 4 | pete | |--------------------------| |----------------------------|
|-----------| | id | names_id | foods_id | | id | names_id | clothes_id |
|----|----------|----------| |----|----------|------------|
table foods | 1 | 1 | 1 | | 1 | 1 | 1 |
|---------------| | 2 | 1 | 3 | | 2 | 1 | 3 |
| id | food | | 3 | 1 | 4 | | 3 | 1 | 4 |
|----|----------| | 4 | 2 | 2 | | 4 | 2 | 2 |
| 1 | beef | | 5 | 2 | 3 | | 5 | 2 | 3 |
| 2 | tomatoes | | 6 | 2 | 4 | | 6 | 2 | 4 |
| 3 | bacon | | 7 | 2 | 5 | | 7 | 3 | 1 |
| 4 | cheese | | 8 | 3 | 3 | | 8 | 3 | 3 |
| 5 | apples | | 9 | 3 | 5 | | 9 | 3 | 5 |
|---------------| | 10 | 4 | 1 | | 10 | 4 | 2 |
| 11 | 4 | 2 | | 11 | 4 | 4 |
table clothes | 12 | 4 | 3 | | 12 | 4 | 5 |
|---------------| | 13 | 4 | 5 | |----------------------------|
| id | clothes | |--------------------------|
|----|----------|
| 1 | trousers |
| 2 | shorts |
| 3 | shirt |
| 4 | socks |
| 5 | jumper |
| 6 | jacket |
|---------------|
可以使用以下 SQL 重新创建 table(改编自 MySQL 数据库,因此可能需要稍作调整才能在 SQL 服务器中工作):
CREATE TABLE `clothes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`clothes` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes` (`id`, `clothes`)
VALUES
(1,'trousers'),
(2,'shorts'),
(3,'shirt'),
(4,'socks'),
(5,'jumper'),
(6,'jacket');
CREATE TABLE `clothes_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`clothes_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes_relationships` (`id`, `names_id`, `clothes_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,3,1),
(8,3,3),
(9,3,5),
(10,4,2),
(11,4,4),
(12,4,5);
CREATE TABLE `food_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`foods_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `food_relationships` (`id`, `names_id`, `foods_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,2,5),
(8,3,3),
(9,3,5),
(10,4,1),
(11,4,2),
(12,4,3),
(13,4,5);
CREATE TABLE `foods` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`food` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `foods` (`id`, `food`)
VALUES
(1,'beef'),
(2,'tomatoes'),
(3,'bacon'),
(4,'cheese'),
(5,'apples');
CREATE TABLE `names` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `names` (`id`, `name`)
VALUES
(1,'dave'),
(2,'phil'),
(3,'john'),
(4,'pete');
我想查询数据库并以某种方式获得以下输出:
|-------------------------------------------------------------|
| name | food | clothes |
|------|------------------------------|-----------------------|
| dave | beef,cheese,bacon | trousers,socks,shirt |
| john | apples,bacon | jumper,shirt,trousers |
| pete | beef,apples,bacon,tomatoes | shorts,jumper,socks |
| phil | bacon,tomatoes,apples,cheese | shirt,shorts,socks |
|-------------------------------------------------------------|
但是,运行 SELECT 查询将“名称”table 连接到其他 table 中的一个或两个(通过相应的查找 tables) 每个名称会产生多行。例如:
SELECT
names.name,
foods.food
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id;
...产生以下结果集:
|-----------------|
| name | food |
|------|----------|
| dave | beef |
| dave | bacon |
| dave | cheese |
| phil | tomatoes |
| phil | bacon |
| phil | cheese |
| phil | apples |
| john | bacon |
| john | apples |
| pete | beef |
| pete | tomatoes |
| pete | bacon |
| pete | apples |
|-----------------|
如果 SELECT 从两个 table 查询 returns 数据,问题会更加复杂:
SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id;
|-----------------------------|
| name | food | clothes |
|------|----------|-----------|
| dave | beef | trousers |
| dave | beef | shirt |
| dave | beef | socks |
| dave | bacon | trousers |
| dave | bacon | shirt |
| dave | bacon | socks |
| dave | cheese | trousers |
| dave | cheese | shirt |
| dave | cheese | socks |
| phil | tomatoes | shorts |
| phil | tomatoes | shirt |
| phil | tomatoes | socks |
| phil | bacon | shorts |
| phil | bacon | shirt |
| phil | bacon | socks |
| phil | cheese | shorts |
| phil | cheese | shirt |
| phil | cheese | socks |
| phil | apples | shorts |
| phil | apples | shirt |
| phil | apples | socks |
| ...
| etc.
问题是,如何查询 SQL 服务器数据库以检索所有数据,但将其处理为每人只有一行?
如果数据库是 MySQL,解决方案会相对简单,因为 MySQL 有一个连接行的 GROUP_CONCAT 函数。因此,对于其中一个 table,我可以使用:
SELECT
names.name,
GROUP_CONCAT(foods.food)
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name);
...给予:
name food
dave beef,cheese,bacon
john apples,bacon
pete beef,apples,bacon,tomatoes
phil bacon,tomatoes,apples,cheese
要从“姓名”和“衣服”table 中获取等效数据,我可以使用类似的方法:
SELECT
temp_foods_table.name AS 'name',
temp_foods_table.food AS 'food',
temp_clothes_table.clothes AS 'clothes'
FROM
(
SELECT
names.name,
GROUP_CONCAT(foods.food) AS 'food'
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name)
) AS temp_foods_table
LEFT JOIN
(
SELECT
names.name,
GROUP_CONCAT(clothes.clothes) AS 'clothes'
FROM
names
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id
GROUP BY (names.name)
) AS temp_clothes_table
ON temp_foods_table.name = temp_clothes_table.name;
...给出以下结果:
name food clothes
dave beef,cheese,bacon trousers,socks,shirt
john apples,bacon jumper,shirt,trousers
pete beef,apples,bacon,tomatoes shorts,jumper,socks
phil bacon,tomatoes,apples,cheese shirt,shorts,socks
但是SQLSERVER中的情况就显得少straight-forward了。对于单个 table 有一些在线建议的解决方案,其中包括使用常见的 table 表达式或 FOR XML PATH。然而,所有的解决方案似乎都有缺点,并且给人的印象是它们是 work-arounds 而不是 specifically-designed 的特征。每个建议的解决方案都有一些弱点(例如,FOR XML PATH 解决方案假定文本是 XML,因此文本中包含的特殊字符可能会导致问题)。此外,一些评论者表示担心此类 work-arounds 基于未记录或已弃用的功能,因此在 long-term.
中可能不可靠因此,我决定不把自己束缚在 SQL 结中,而是使用 Python 和 Pandas 处理数据 post-retrieval。无论如何,我总是将数据传输到 Pandas 数据框以进行绘图和分析,因此这不是一个重大的不便。为了连接多个列的数据,我使用了 groupby()。但是,由于有两个 many-to-many table,每列中都有重复项,因此,最终的连接字符串包含所有这些重复项。为了只有唯一值,我使用了 Python 集(根据定义,它只能包含唯一值)。这种方法的唯一潜在缺点是字符串的顺序无法保持,但就我的情况而言,这不是问题。最终的 Python 解决方案如下所示:
导入必要的库:
>>> import pandas as pd
>>> import pymssql
>>> import getpass
输入连接数据库所需的详细信息:
>>> myServer = input("Enter server address: ")
>>> myUser = input("Enter username: ")
>>> myPwd = getpass.getpass("Enter password: ")
创建连接:
>>> myConnection = pymssql.connect(server=myServer, user=myUser, password=myPwd, port='1433')
定义查询以检索必要的数据:
>>> myQuery = """SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id """
运行 查询,将结果放入数据框并关闭连接:
>>> myLatestData = pd.io.sql.read_sql(myQuery, con=myConnection)
>>> myConnection.close()
连接多行中的字符串并删除重复项:
>>> tempDF = tempDF.groupby('name').agg(lambda col: ','.join(set(col)))
打印最终数据帧:
>>> print(tempDF)
name food clothes
dave beef,bacon,cheese socks,trousers,shirt
john bacon,apples jumper,trousers,shirt
pete tomatoes,beef,bacon,apples socks,jumper,shorts
phil tomatoes,bacon,cheese,apples socks,shorts,shirt
对我来说,这个解决方案比尝试在 SQL 查询中进行所有数据处理更直观。希望这对其他人有帮助。
如果是MS-Sql服务器..
您可以使用STUFF功能。例如
声明@Heroes TABLE ( [英雄名] VARCHAR(20) )
插入@Heroes ( [HeroName] ) 值('Superman'),('Batman'),('Ironman'),('Wolverine')
SELECT 东西((SELECT ',' + [HeroName] 来自@Heroes 按 [HeroName] 订购 FOR XML PATH('')), 1, 1, '') AS [输出]
输出
蝙蝠侠、钢铁侠、超人、金刚狼
我认为这应该可以回答您的问题。
谢谢