将分层数据加入 R 中的另一个 table
Joining hierarchical data to another table in R
我有 2 个数据框(drug
和 class
),我需要通过最后一级 ATC classification 代码加入它们,并添加 4 个具有相应父级的附加列.
我提出了 2 个解决方案,但第一个非常冗长,第二个是使用 MS Access(我想避免)。
此外,如果我有更多级别,代码将比这个代码更冗长。
这个问题有没有更优雅的解决方案?如何像在 Access 中那样在 R 中执行这种自连接?
我是 R 和 SQL 的初学者,所以不胜感激 :)
示例数据信息
毒品
- cols: ID, ProductName, level5
- 每一行是一个产品(药品),具有唯一 ID 和 ATC 的 5 级代码 classification(参见 Wiki-ATC)
class
- 列:class代码,class名称
classCode
在同一列包含ATC的所有级别class化,ATC的level1-level5
- 注意:这个table我只能看
关于那些class化验和级别的简短说明
在 drug$level5
我们有 Level5 classCode
s:
Level5 - A10BA02(二甲双胍)。它是 4 级 - A10BA(双胍类)、3 级 - A10B(抗糖尿病药,ex.insulins)、2 级 - A10(抗糖尿病药)、1 级 - A(消化道和新陈代谢)
的成员
每个级别都严格按照其长度定义(L1 = 1chars., L2 = 3chars., L3 = 4chars., L4 = 5chars., L5 = 7chars.)
| Level | Code | Name |
|---------|---------|---------------------------------|
| Level5* | A10BA02 | metformin |
| Level4 | A10BA | biguanides |
| Level3 | A10B | antidiabetics, ex. insulins |
| level2 | A10 | antidiabetics |
| Level1 | A | alimentary tract and metabolism |
示例数据
drug <- data.frame(ID = 1:5,
ProductName = c('ABC', 'CDE', 'FGH', 'IJK', 'LMN'),
level5 = c('A10BA02', 'C01BA02', 'C03CA01', 'C03CA03', 'C01BA02'),
stringsAsFactors = F)
class <- data.frame(code = c('A', 'A10', 'A10B', 'A10BA', 'A10BA02', 'C', 'C01', 'C01B', 'C01BA',
'C01BA02', 'C03', 'C03C', 'C03CA', 'C03CA01', 'C03CA03', 'C07', 'C07A',
'C07AA', 'C07AA03'),
className = c('Alimentary tract and metabolism',
'Antidiabetics', 'Antidiabetics, except insulins',
'Biguanides', 'Metformin', 'Cardiovascular system',
'Cardiacs', 'Antiarythmics, grp I and III',
'Antiarythmics, grp IA', 'Procainamide', 'Diuretics',
'Diuretics strong', 'Sulfonamides', 'Furosemide',
'Piretanide', 'Betablockers', 'Betablockers',
'Non-selective betablockers', 'Pindolol'),
stringsAsFactors = F)
# print
drug
head(class, 8)
目标
我想在 drug
数据帧上左连接 class
,结果 df 如下:
结果 table 应该有额外的列,每列对应从 1 到 5 的每个级别。
目标是创建一个过滤层次结构,您首先按级别 1 过滤产品,然后按级别 2,依此类推……
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| ID | ProductName | L1 | L2 | L3 | L4 | L5 |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 1 | ABC | A - Alimentary tract and metabolism | A10 - Antidiabetics | A10B - Antidiabetics, except insulins | A10BA - Biguanides | A10BA02 - Metformin |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 2 | CDE | C - Cardiovascular system | C01 - Cardiacs | C01B - Antiarythmics, grp I and III | C01BA - Antiarythmics, grp IA | C01BA02 - Procainamide |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
...
我只使用 R 的脏解决方案 N.1
我想出了一个不漂亮且非常冗长的解决方案,我在每个级别上将 drug$level5
更改为 substr()
。然后执行 left_join()
和 unite()
列之后。
library(tidyr)
library(dplyr)
sol1 <- drug %>%
mutate(level1 = substr(level5, 1, 1),
level2 = substr(level5, 1, 3),
level3 = substr(level5, 1, 4),
level4 = substr(level5, 1, 5)) %>%
left_join(class, by = c('level1' = 'code')) %>%
left_join(class, by = c('level2' = 'code')) %>%
left_join(class, by = c('level3' = 'code')) %>%
left_join(class, by = c('level4' = 'code')) %>%
left_join(class, by = c('level5' = 'code')) %>%
select(ID:level4,
level1name = className.x,
level2name = className.y,
level3name = className.x.x,
level4name = className.y.y,
level5name = className
) %>%
unite(L1, level1, level1name, sep = ' - ') %>%
unite(L2, level2, level2name, sep = ' - ') %>%
unite(L3, level3, level3name, sep = ' - ') %>%
unite(L4, level4, level4name, sep = ' - ') %>%
unite(L5, level5, level5name, sep = ' - ')
我的解决方案 N.2 使用 Access 自连接
另一个解决方案是在 MS Access 中使用 self join
重塑 class
table 为每个级别创建额外的列,然后简单地在 [= 上加入此 table 17=] R 中的 df
--- sqlReshapedTable
SELECT A.code AS L5,
A.className AS className,
L1.code + ' ' + L1.Name AS L1,
L2.code + ' ' + L2.Name AS L2,
L3.code + ' ' + L3.Name AS L3,
L4.code + ' ' + L4.Name AS L4
FROM
(((class AS A
INNER JOIN class AS L1 ON L1.code = LEFT(A.code, 1))
INNER JOIN class AS L2 ON L2.code = LEFT(A.code, 3))
INNER JOIN class AS L3 ON L3.code = LEFT(A.code, 4))
INNER JOIN class AS L4 ON L4.code = LEFT(A.code, 5);
sol2 <- drug %>%
left_join(sqlReshapedTable, by = c('level5' = 'Code'))
非常感谢您的帮助!
也许不是最好的解决方案,但似乎适用于您的情况(我将您的数据框 class
称为 dclass
):
library(tidyverse)
drug %>%
group_by(ID, ProductName) %>%
summarise(
code = list(map_chr(c(1, 3:5, 7), ~ gsub(sprintf('(^.{%s}).*', .x), '\1', level5)))
) %>%
unnest %>%
left_join(dclass, by = 'code') %>%
rename_all(tolower) %>%
mutate(
key = paste('L', str_count(code, '\D|\d+'), sep = ''),
val = paste(code, classname, sep = ' - '),
classname = NULL,
code = NULL
) %>%
spread(key, val) %>%
ungroup() %>%
arrange(L5) %>%
rename('ID' = id, 'Product Name' = productname)
输出:
# A tibble: 5 x 7
# ID `Product Name` L1 L2 L3 L4 L5
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1 ABC A - Alimentary trac… A10 - Antid… A10B - Antidiabetics… A10BA - Biguanid… A10BA02 - Me…
#2 2 CDE C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#3 5 LMN C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#4 3 FGH C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA01 - Fu…
#5 4 IJK C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA03 - Pi…
我有 2 个数据框(drug
和 class
),我需要通过最后一级 ATC classification 代码加入它们,并添加 4 个具有相应父级的附加列.
我提出了 2 个解决方案,但第一个非常冗长,第二个是使用 MS Access(我想避免)。 此外,如果我有更多级别,代码将比这个代码更冗长。 这个问题有没有更优雅的解决方案?如何像在 Access 中那样在 R 中执行这种自连接? 我是 R 和 SQL 的初学者,所以不胜感激 :)
示例数据信息
毒品
- cols: ID, ProductName, level5
- 每一行是一个产品(药品),具有唯一 ID 和 ATC 的 5 级代码 classification(参见 Wiki-ATC)
class
- 列:class代码,class名称
classCode
在同一列包含ATC的所有级别class化,ATC的level1-level5- 注意:这个table我只能看
关于那些class化验和级别的简短说明
在 drug$level5
我们有 Level5 classCode
s:
Level5 - A10BA02(二甲双胍)。它是 4 级 - A10BA(双胍类)、3 级 - A10B(抗糖尿病药,ex.insulins)、2 级 - A10(抗糖尿病药)、1 级 - A(消化道和新陈代谢)
的成员
每个级别都严格按照其长度定义(L1 = 1chars., L2 = 3chars., L3 = 4chars., L4 = 5chars., L5 = 7chars.)
| Level | Code | Name |
|---------|---------|---------------------------------|
| Level5* | A10BA02 | metformin |
| Level4 | A10BA | biguanides |
| Level3 | A10B | antidiabetics, ex. insulins |
| level2 | A10 | antidiabetics |
| Level1 | A | alimentary tract and metabolism |
示例数据
drug <- data.frame(ID = 1:5,
ProductName = c('ABC', 'CDE', 'FGH', 'IJK', 'LMN'),
level5 = c('A10BA02', 'C01BA02', 'C03CA01', 'C03CA03', 'C01BA02'),
stringsAsFactors = F)
class <- data.frame(code = c('A', 'A10', 'A10B', 'A10BA', 'A10BA02', 'C', 'C01', 'C01B', 'C01BA',
'C01BA02', 'C03', 'C03C', 'C03CA', 'C03CA01', 'C03CA03', 'C07', 'C07A',
'C07AA', 'C07AA03'),
className = c('Alimentary tract and metabolism',
'Antidiabetics', 'Antidiabetics, except insulins',
'Biguanides', 'Metformin', 'Cardiovascular system',
'Cardiacs', 'Antiarythmics, grp I and III',
'Antiarythmics, grp IA', 'Procainamide', 'Diuretics',
'Diuretics strong', 'Sulfonamides', 'Furosemide',
'Piretanide', 'Betablockers', 'Betablockers',
'Non-selective betablockers', 'Pindolol'),
stringsAsFactors = F)
# print
drug
head(class, 8)
目标
我想在 drug
数据帧上左连接 class
,结果 df 如下:
结果 table 应该有额外的列,每列对应从 1 到 5 的每个级别。
目标是创建一个过滤层次结构,您首先按级别 1 过滤产品,然后按级别 2,依此类推……
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| ID | ProductName | L1 | L2 | L3 | L4 | L5 |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 1 | ABC | A - Alimentary tract and metabolism | A10 - Antidiabetics | A10B - Antidiabetics, except insulins | A10BA - Biguanides | A10BA02 - Metformin |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 2 | CDE | C - Cardiovascular system | C01 - Cardiacs | C01B - Antiarythmics, grp I and III | C01BA - Antiarythmics, grp IA | C01BA02 - Procainamide |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
...
我只使用 R 的脏解决方案 N.1
我想出了一个不漂亮且非常冗长的解决方案,我在每个级别上将 drug$level5
更改为 substr()
。然后执行 left_join()
和 unite()
列之后。
library(tidyr)
library(dplyr)
sol1 <- drug %>%
mutate(level1 = substr(level5, 1, 1),
level2 = substr(level5, 1, 3),
level3 = substr(level5, 1, 4),
level4 = substr(level5, 1, 5)) %>%
left_join(class, by = c('level1' = 'code')) %>%
left_join(class, by = c('level2' = 'code')) %>%
left_join(class, by = c('level3' = 'code')) %>%
left_join(class, by = c('level4' = 'code')) %>%
left_join(class, by = c('level5' = 'code')) %>%
select(ID:level4,
level1name = className.x,
level2name = className.y,
level3name = className.x.x,
level4name = className.y.y,
level5name = className
) %>%
unite(L1, level1, level1name, sep = ' - ') %>%
unite(L2, level2, level2name, sep = ' - ') %>%
unite(L3, level3, level3name, sep = ' - ') %>%
unite(L4, level4, level4name, sep = ' - ') %>%
unite(L5, level5, level5name, sep = ' - ')
我的解决方案 N.2 使用 Access 自连接
另一个解决方案是在 MS Access 中使用 self join
重塑 class
table 为每个级别创建额外的列,然后简单地在 [= 上加入此 table 17=] R 中的 df
--- sqlReshapedTable
SELECT A.code AS L5,
A.className AS className,
L1.code + ' ' + L1.Name AS L1,
L2.code + ' ' + L2.Name AS L2,
L3.code + ' ' + L3.Name AS L3,
L4.code + ' ' + L4.Name AS L4
FROM
(((class AS A
INNER JOIN class AS L1 ON L1.code = LEFT(A.code, 1))
INNER JOIN class AS L2 ON L2.code = LEFT(A.code, 3))
INNER JOIN class AS L3 ON L3.code = LEFT(A.code, 4))
INNER JOIN class AS L4 ON L4.code = LEFT(A.code, 5);
sol2 <- drug %>%
left_join(sqlReshapedTable, by = c('level5' = 'Code'))
非常感谢您的帮助!
也许不是最好的解决方案,但似乎适用于您的情况(我将您的数据框 class
称为 dclass
):
library(tidyverse)
drug %>%
group_by(ID, ProductName) %>%
summarise(
code = list(map_chr(c(1, 3:5, 7), ~ gsub(sprintf('(^.{%s}).*', .x), '\1', level5)))
) %>%
unnest %>%
left_join(dclass, by = 'code') %>%
rename_all(tolower) %>%
mutate(
key = paste('L', str_count(code, '\D|\d+'), sep = ''),
val = paste(code, classname, sep = ' - '),
classname = NULL,
code = NULL
) %>%
spread(key, val) %>%
ungroup() %>%
arrange(L5) %>%
rename('ID' = id, 'Product Name' = productname)
输出:
# A tibble: 5 x 7
# ID `Product Name` L1 L2 L3 L4 L5
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1 ABC A - Alimentary trac… A10 - Antid… A10B - Antidiabetics… A10BA - Biguanid… A10BA02 - Me…
#2 2 CDE C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#3 5 LMN C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#4 3 FGH C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA01 - Fu…
#5 4 IJK C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA03 - Pi…