如何通过匹配另一个数据框中的整个列中的字符串来检索一个数据框中的值?
How to retrieve value in one data frame by matching a string within an entire column from another data frame?
假设我有一个如下所示的数据框 df1
:
> df1
probe OMIM
1 1565034_s_at 601464
2 201000_at 601065 /// 613287 /// 616339
3 204565_at 615652
4 205355_at 600301 /// 610006
5 205734_s_at 601464
6 205735_s_at 601464
7 206527_at 137150 /// 613163
8 209173_at 606358
9 209459_s_at 137150 /// 613163
10 209460_at 137150 /// 613163
11 215465_at
12 223864_at 610856
13 224742_at 612674 /// 613599
还有第二个数据框,df2
:
> df2
platprobe symbol
1 1565034_s_at,205734_s_at,242078_at,205735_s_at AFF3
2 201000_at AARS
3 201884_at DNALI1
4 202779_s_at PLK1
5 204565_at ACOT13
6 205355_at,226030_at ACADSB
7 205808_at,207284_s_at,209135_at,210896_s_at LIMCH1
8 206164_at,206165_s_at,206166_s_at,217528_at SLC7A8
9 206527_at,209459_s_at,209460_at ABAT
10 209173_at,228969_at AGR2
11 215465_at ABCA12
12 221024_s_at TMEM144
13 223864_at ANKRD30A
14 224742_at,228123_s_at,228124_at ABHD12
15 225421_at,225431_x_at GALNT7
16 226120_at PSAT1
17 228241_at AGR3
我想根据 df1$probe
与 df2$platprobe
的匹配值向 df1
、df1$symbol
添加一个新列。结果应该是这样的:
> df1
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
对我来说具有挑战性的部分是 df2$platprobe
在许多情况下包含除了找到 in df1$probe
之外的各种注释。所以,如果我尝试:
#This will retrieve only perfect matches (where df2$platprobe contains only one possible value, such as ABCA12):
df1$symbol <- df2$symbol[df2$probe %in% df1$platprobe]
#And if I use 'grepl', that won't work:
#(The reason for using 'unlist' and 'strsplit' is because I thought that maybe breaking all possible
#values from the entire df2$platprobe into a object that would work. But it doesn't)
df1$symbol <- df2$symbol[grepl(df1$probe, unlist(strsplit(paste(df2$platprobe, sep=",", collapse=","), ",")))]
非常感谢任何帮助。
PS: 另外如果大家有更好的想法,多一个话题标题,非常欢迎。
更新
谢谢@Anoushiravan R。很抱歉没有把可重现的 df 放在前面。现在,他们在这里:
df1 <- data.frame(probe=c("1565034_s_at", "201000_at", "204565_at",
"205355_at", "205734_s_at", "205735_s_at", "206527_at", "209173_at",
"209459_s_at", "209460_at", "215465_at", "223864_at", "224742_at"
), OMIM = c("601464", "601065 /// 613287 /// 616339", "615652",
"600301 /// 610006", "601464", "601464", "137150 /// 613163",
"606358", "137150 /// 613163", "137150 /// 613163", "", "610856",
"612674 /// 613599"))
df2 <- data.frame(platprobe = c("1565034_s_at, 205734_s_at, 205735_s_at,
227198_at, 242078_at, 243967_at", "201000_at", "201884_at", "202779_s_at",
"204565_at", "205355_at,226030_at", "205808_at, 207284_s_at, 209135_at,
210896_s_at, 224996_at, 225008_at, 242037_at", "206164_at, 206165_s_at,
206166_s_at, 217528_at", "206527_at, 209459_s_at,209460_at", "209173_at,
228969_at", "215465_at", "221024_s_at", "223864_at","224742_at, 228123_s_at,
228124_at", "225421_at,225431_x_at", "226120_at", "228241_at"), symbol=c("AFF3",
"AARS", "DNALI1", "PLK1", "ACOT13", "ACADSB", "LIMCH1", "SLC7A8", "ABAT", "AGR2",
"ABCA12", "TMEM144", "ANKRD30A", "ABHD12", "GALNT7", "PSAT1", "AGR3"))
您可以使用以下解决方案:
library(dplyr)
library(stringr)
library(purrr)
df1 %>%
mutate(symbol = map_chr(probe, ~ df2$symbol[which(str_detect(df2$platprobe, .x))]))
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
虽然上面的答案达到了目的,但要表明它可以在没有 purrr
的情况下完成
library(dplyr)
library(tidyr)
library(stringr)
df1 %>% left_join(df2 %>% separate_rows(platprobe, sep = ',') %>%
mutate(platprobe = str_trim(platprobe)), by = c('probe' = 'platprobe'))
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
处理您的问题的另一种方法是基于您的观点/观察,即您的匹配键在第二个数据帧中“折叠”。
{tidyr}
有一个很好的功能,可以在新行中拆分嵌套值,即 tidyr()::separate_rows()
。这将使您的第二个 df 变成长格式。
注意:separate_rows()
允许根据需要拆分多个列。
但是这里我们只使用你的密钥 platprobe
.
library(dplyr) # data crunching
library(tidyr) # data manipulation for generating tidy df
# how to separate the nested column values to rows
df2 %>% separate_rows(platprobe, sep = ",")
检查行分布:
# A tibble: 33 x 2
platprobe symbol
<chr> <chr>
1 1565034_s_at AFF3
2 205734_s_at AFF3
3 242078_at AFF3
4 205735_s_at AFF3
5 201000_at AARS
...
您现在可以正确对齐匹配键并执行 left_join()
合并两个数据框。
# merging the "long" lookup df2 with df1
df1 %>% left_join(
df2 %>% separate_rows(platprobe, sep = ",")
, by = c("probe" = "platprobe") # define matching keys in df1 and df2
)
这提供了
probe symbol
1 1565034_s_at AFF3
2 201000_at AARS
3 204565_at ACOT13
4 205355_at ACADSB
...
如果您想使用 grep
进行匹配,您可以通过 sapply
或 lapply
.
进行匹配
df1$symbol <- df2$symbol[sapply(df1$probe, grep, df2$platprobe)]
df1
# probe OMIM symbol
#1 1565034_s_at 601464 AFF3
#2 201000_at 601065 /// 613287 /// 616339 AARS
#3 204565_at 615652 ACOT13
#4 205355_at 600301 /// 610006 ACADSB
#5 205734_s_at 601464 AFF3
#6 205735_s_at 601464 AFF3
#7 206527_at 137150 /// 613163 ABAT
#8 209173_at 606358 AGR2
#9 209459_s_at 137150 /// 613163 ABAT
#10 209460_at 137150 /// 613163 ABAT
#11 215465_at ABCA12
#12 223864_at 610856 ANKRD30A
#13 224742_at 612674 /// 613599 ABHD12
假设我有一个如下所示的数据框 df1
:
> df1
probe OMIM
1 1565034_s_at 601464
2 201000_at 601065 /// 613287 /// 616339
3 204565_at 615652
4 205355_at 600301 /// 610006
5 205734_s_at 601464
6 205735_s_at 601464
7 206527_at 137150 /// 613163
8 209173_at 606358
9 209459_s_at 137150 /// 613163
10 209460_at 137150 /// 613163
11 215465_at
12 223864_at 610856
13 224742_at 612674 /// 613599
还有第二个数据框,df2
:
> df2
platprobe symbol
1 1565034_s_at,205734_s_at,242078_at,205735_s_at AFF3
2 201000_at AARS
3 201884_at DNALI1
4 202779_s_at PLK1
5 204565_at ACOT13
6 205355_at,226030_at ACADSB
7 205808_at,207284_s_at,209135_at,210896_s_at LIMCH1
8 206164_at,206165_s_at,206166_s_at,217528_at SLC7A8
9 206527_at,209459_s_at,209460_at ABAT
10 209173_at,228969_at AGR2
11 215465_at ABCA12
12 221024_s_at TMEM144
13 223864_at ANKRD30A
14 224742_at,228123_s_at,228124_at ABHD12
15 225421_at,225431_x_at GALNT7
16 226120_at PSAT1
17 228241_at AGR3
我想根据 df1$probe
与 df2$platprobe
的匹配值向 df1
、df1$symbol
添加一个新列。结果应该是这样的:
> df1
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
对我来说具有挑战性的部分是 df2$platprobe
在许多情况下包含除了找到 in df1$probe
之外的各种注释。所以,如果我尝试:
#This will retrieve only perfect matches (where df2$platprobe contains only one possible value, such as ABCA12):
df1$symbol <- df2$symbol[df2$probe %in% df1$platprobe]
#And if I use 'grepl', that won't work:
#(The reason for using 'unlist' and 'strsplit' is because I thought that maybe breaking all possible
#values from the entire df2$platprobe into a object that would work. But it doesn't)
df1$symbol <- df2$symbol[grepl(df1$probe, unlist(strsplit(paste(df2$platprobe, sep=",", collapse=","), ",")))]
非常感谢任何帮助。
PS: 另外如果大家有更好的想法,多一个话题标题,非常欢迎。
更新 谢谢@Anoushiravan R。很抱歉没有把可重现的 df 放在前面。现在,他们在这里:
df1 <- data.frame(probe=c("1565034_s_at", "201000_at", "204565_at",
"205355_at", "205734_s_at", "205735_s_at", "206527_at", "209173_at",
"209459_s_at", "209460_at", "215465_at", "223864_at", "224742_at"
), OMIM = c("601464", "601065 /// 613287 /// 616339", "615652",
"600301 /// 610006", "601464", "601464", "137150 /// 613163",
"606358", "137150 /// 613163", "137150 /// 613163", "", "610856",
"612674 /// 613599"))
df2 <- data.frame(platprobe = c("1565034_s_at, 205734_s_at, 205735_s_at,
227198_at, 242078_at, 243967_at", "201000_at", "201884_at", "202779_s_at",
"204565_at", "205355_at,226030_at", "205808_at, 207284_s_at, 209135_at,
210896_s_at, 224996_at, 225008_at, 242037_at", "206164_at, 206165_s_at,
206166_s_at, 217528_at", "206527_at, 209459_s_at,209460_at", "209173_at,
228969_at", "215465_at", "221024_s_at", "223864_at","224742_at, 228123_s_at,
228124_at", "225421_at,225431_x_at", "226120_at", "228241_at"), symbol=c("AFF3",
"AARS", "DNALI1", "PLK1", "ACOT13", "ACADSB", "LIMCH1", "SLC7A8", "ABAT", "AGR2",
"ABCA12", "TMEM144", "ANKRD30A", "ABHD12", "GALNT7", "PSAT1", "AGR3"))
您可以使用以下解决方案:
library(dplyr)
library(stringr)
library(purrr)
df1 %>%
mutate(symbol = map_chr(probe, ~ df2$symbol[which(str_detect(df2$platprobe, .x))]))
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
虽然上面的答案达到了目的,但要表明它可以在没有 purrr
的情况下完成
library(dplyr)
library(tidyr)
library(stringr)
df1 %>% left_join(df2 %>% separate_rows(platprobe, sep = ',') %>%
mutate(platprobe = str_trim(platprobe)), by = c('probe' = 'platprobe'))
probe OMIM symbol
1 1565034_s_at 601464 AFF3
2 201000_at 601065 /// 613287 /// 616339 AARS
3 204565_at 615652 ACOT13
4 205355_at 600301 /// 610006 ACADSB
5 205734_s_at 601464 AFF3
6 205735_s_at 601464 AFF3
7 206527_at 137150 /// 613163 ABAT
8 209173_at 606358 AGR2
9 209459_s_at 137150 /// 613163 ABAT
10 209460_at 137150 /// 613163 ABAT
11 215465_at ABCA12
12 223864_at 610856 ANKRD30A
13 224742_at 612674 /// 613599 ABHD12
处理您的问题的另一种方法是基于您的观点/观察,即您的匹配键在第二个数据帧中“折叠”。
{tidyr}
有一个很好的功能,可以在新行中拆分嵌套值,即 tidyr()::separate_rows()
。这将使您的第二个 df 变成长格式。
注意:separate_rows()
允许根据需要拆分多个列。
但是这里我们只使用你的密钥 platprobe
.
library(dplyr) # data crunching
library(tidyr) # data manipulation for generating tidy df
# how to separate the nested column values to rows
df2 %>% separate_rows(platprobe, sep = ",")
检查行分布:
# A tibble: 33 x 2
platprobe symbol
<chr> <chr>
1 1565034_s_at AFF3
2 205734_s_at AFF3
3 242078_at AFF3
4 205735_s_at AFF3
5 201000_at AARS
...
您现在可以正确对齐匹配键并执行 left_join()
合并两个数据框。
# merging the "long" lookup df2 with df1
df1 %>% left_join(
df2 %>% separate_rows(platprobe, sep = ",")
, by = c("probe" = "platprobe") # define matching keys in df1 and df2
)
这提供了
probe symbol
1 1565034_s_at AFF3
2 201000_at AARS
3 204565_at ACOT13
4 205355_at ACADSB
...
如果您想使用 grep
进行匹配,您可以通过 sapply
或 lapply
.
df1$symbol <- df2$symbol[sapply(df1$probe, grep, df2$platprobe)]
df1
# probe OMIM symbol
#1 1565034_s_at 601464 AFF3
#2 201000_at 601065 /// 613287 /// 616339 AARS
#3 204565_at 615652 ACOT13
#4 205355_at 600301 /// 610006 ACADSB
#5 205734_s_at 601464 AFF3
#6 205735_s_at 601464 AFF3
#7 206527_at 137150 /// 613163 ABAT
#8 209173_at 606358 AGR2
#9 209459_s_at 137150 /// 613163 ABAT
#10 209460_at 137150 /// 613163 ABAT
#11 215465_at ABCA12
#12 223864_at 610856 ANKRD30A
#13 224742_at 612674 /// 613599 ABHD12