从 R 中的字符串向量中匹配单词
Matching words from vectors of strings in R
我正在尝试通过将混乱的站点名称列表与已批准的列表进行匹配来清理数据库。
例如,首选站点名称可能是 'Cotswold Water Park Pit 28',但该站点已作为以下名称输入到数据库中:'Pit 28'、'28'、'CWP Pit 28' 和 'Cotswold 28'.
数据看起来像这样:
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
我正在寻找一种方法来匹配 messy
中每个元素中的 words/numbers(非 space 字符的簇)与每个元素中的 words/numbers approved
中的元素。理想情况下,我会得到这样的结果:
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "14" NA
[4,] "Cotswold 28" NA NA
approved
元素构成列名称,messy
中包含匹配 words/numbers 的任何元素出现在该列的单元格中。我知道会有一些错误的匹配。这很好,我可以稍后手动过滤它们,并可能从模式匹配中排除 'forest' 和 'hill' 等常见词。
通过使用 regex
拆分 messy
中的每个元素,我已经能够使用上述示例数据获得我想要的结果,但随后我正在处理 [=45= 的列表] 从站点名称列表中,我不得不使用嵌套循环或 sapply
将它们与已批准的元素匹配,因为像 grep
、grepl
和 [=22= 这样的函数] 只允许一种模式。由于数据库很大,当我将它应用到整个事物时,这已经花费了很长时间。我真正想要的是一个功能:
match(any word in approved[1], any word in messy[1])
要么给我一个 TRUE FALSE
输出,要么提取 messy[1]
如果匹配就太好了!
一个tidyverse/tidytext解决方案
先把它们变成数据框
require(tidyverse)
require(tidytext)
## create dataframe for approved
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
## create dataframe for messy
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
然后使用 tidytext 将它们拆分为 1 个字 = 1 行,我喜欢在行数发生变化时添加 ID ...
## split into words
approved_df <-
tibble(approved = approved) %>%
rownames_to_column('approved_id') %>%
unnest_tokens(words, approved, 'words', drop = FALSE)
approved_df %>% head
# A tibble: 6 x 3
# approved_id approved words
# <chr> <chr> <chr>
# 1 1 Cotswold Water Park Pit 28 cotswold
# 2 1 Cotswold Water Park Pit 28 water
# 3 1 Cotswold Water Park Pit 28 park
# 4 1 Cotswold Water Park Pit 28 pit
# 5 1 Cotswold Water Park Pit 28 28
# 6 2 Cotswold Water Park Pit 14 cotswold
messy_df <-
tibble(messy = messy) %>%
rownames_to_column('messy_id') %>%
unnest_tokens(words, messy, 'words', drop = FALSE)
messy_df %>% head
# # A tibble: 6 x 3
# messy_id messy words
# <chr> <chr> <chr>
# 1 1 Pit 28 pit
# 2 1 Pit 28 28
# 3 2 28 28
# 4 3 CWP Pit 28 cwp
# 5 3 CWP Pit 28 pit
# 6 3 CWP Pit 28 28
最后,将两个dataframes在word级别join,统计重叠的单词有多少,然后给每个“乱七八糟”的字符串分配一个“认可的”
## join the data sets and rank by the number of words in the overlap
messy_df %>% left_join(approved_df) %>%
group_by(messy, messy_id, approved, approved_id) %>%
summarise(n_row = n()) %>%
ungroup %>%
group_by(messy, messy_id) %>%
mutate(approved_rank = rank(desc(n_row))) %>%
ungroup %>%
filter(approved_rank == 1) %>%
arrange(messy_id)
# Joining, by = "words"
# # A tibble: 6 x 6
# messy messy_id approved approved_id n_row approved_rank
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 Pit 28 1 Cotswold Water Park Pit 28 1 2 1
# 2 28 2 Cotswold Water Park Pit 28 1 1 1
# 3 CWP Pit 28 3 Cotswold Water Park Pit 28 1 2 1
# 4 Cotswold 28 4 Cotswold Water Park Pit 28 1 2 1
# 5 14 5 Cotswold Water Park Pit 14 2 1 1
# 6 Robinswood 6 Robinswood Hill 3 1 1
我不确定我下面的尝试是否符合您的目的
res <- within(
expand.grid(messy, approved),
matched <- do.call(
function(...) lengths(mapply(intersect, ...)) > 0,
unname(expand.grid(strsplit(messy, " "), strsplit(approved, " ")))
)
)
给予
Var1 Var2 matched
1 Pit 28 Cotswold Water Park Pit 28 TRUE
2 28 Cotswold Water Park Pit 28 TRUE
3 CWP Pit 28 Cotswold Water Park Pit 28 TRUE
4 Cotswold 28 Cotswold Water Park Pit 28 TRUE
5 14 Cotswold Water Park Pit 28 FALSE
6 Robinswood Cotswold Water Park Pit 28 FALSE
7 Pit 28 Cotswold Water Park Pit 14 TRUE
8 28 Cotswold Water Park Pit 14 FALSE
9 CWP Pit 28 Cotswold Water Park Pit 14 TRUE
10 Cotswold 28 Cotswold Water Park Pit 14 TRUE
11 14 Cotswold Water Park Pit 14 TRUE
12 Robinswood Cotswold Water Park Pit 14 FALSE
13 Pit 28 Robinswood Hill FALSE
14 28 Robinswood Hill FALSE
15 CWP Pit 28 Robinswood Hill FALSE
16 Cotswold 28 Robinswood Hill FALSE
17 14 Robinswood Hill FALSE
18 Robinswood Robinswood Hill TRUE
如果您想在 post 中显示输出,您可以在 res
上进一步玩一些技巧,例如,
res2 <- do.call(
cbind,
lapply(
u <- with(subset(res, matched), split(Var1, Var2)),
function(x) `length<-`(as.vector(x), max(lengths(u)))
)
)
这样
> res2
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "Cotswold 28" NA
[4,] "Cotswold 28" "14" NA
也许您正在寻找 adist
:
x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
基本 R 选项为:
result <- sapply(approved, function(x) grep(gsub('\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
这里的逻辑是我们在 approved
中的每个空格处插入竖线 (|
) 符号,并且 return 在 messy
中的单词(如果任何单词匹配)。
要获得与所示格式相同的输出,我们可以这样做:
sapply(result, `[`, 1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA
这是一个高度灵活的 regex_join 解决方案
library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]
# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood
这是使用 stringi
的一种可能性(比 stringr
快,通常比基本 R 正则表达式操作快。这个解决方案 returns 一个列表应该比长度可变时的矩阵。
library(stringi)
messy_ors <- stri_replace_all(messy, " ", "|")
lapply(approved, function(x) messy[stri_detect(x, regex = messy_ors)])
$`Cotswold Water Park Pit 28`
[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
$`Cotswold Water Park Pit 14`
[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
$`Robinswood Hill`
[1] "Robinswood"
如果你真的需要一个矩阵,你可以用类似的东西转换输出:
n <- max(lengths(out))
sapply(out, function(x) x[1:n])
我正在尝试通过将混乱的站点名称列表与已批准的列表进行匹配来清理数据库。
例如,首选站点名称可能是 'Cotswold Water Park Pit 28',但该站点已作为以下名称输入到数据库中:'Pit 28'、'28'、'CWP Pit 28' 和 'Cotswold 28'.
数据看起来像这样:
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
我正在寻找一种方法来匹配 messy
中每个元素中的 words/numbers(非 space 字符的簇)与每个元素中的 words/numbers approved
中的元素。理想情况下,我会得到这样的结果:
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "14" NA
[4,] "Cotswold 28" NA NA
approved
元素构成列名称,messy
中包含匹配 words/numbers 的任何元素出现在该列的单元格中。我知道会有一些错误的匹配。这很好,我可以稍后手动过滤它们,并可能从模式匹配中排除 'forest' 和 'hill' 等常见词。
通过使用 regex
拆分 messy
中的每个元素,我已经能够使用上述示例数据获得我想要的结果,但随后我正在处理 [=45= 的列表] 从站点名称列表中,我不得不使用嵌套循环或 sapply
将它们与已批准的元素匹配,因为像 grep
、grepl
和 [=22= 这样的函数] 只允许一种模式。由于数据库很大,当我将它应用到整个事物时,这已经花费了很长时间。我真正想要的是一个功能:
match(any word in approved[1], any word in messy[1])
要么给我一个 TRUE FALSE
输出,要么提取 messy[1]
如果匹配就太好了!
一个tidyverse/tidytext解决方案
先把它们变成数据框
require(tidyverse)
require(tidytext)
## create dataframe for approved
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
## create dataframe for messy
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
然后使用 tidytext 将它们拆分为 1 个字 = 1 行,我喜欢在行数发生变化时添加 ID ...
## split into words
approved_df <-
tibble(approved = approved) %>%
rownames_to_column('approved_id') %>%
unnest_tokens(words, approved, 'words', drop = FALSE)
approved_df %>% head
# A tibble: 6 x 3
# approved_id approved words
# <chr> <chr> <chr>
# 1 1 Cotswold Water Park Pit 28 cotswold
# 2 1 Cotswold Water Park Pit 28 water
# 3 1 Cotswold Water Park Pit 28 park
# 4 1 Cotswold Water Park Pit 28 pit
# 5 1 Cotswold Water Park Pit 28 28
# 6 2 Cotswold Water Park Pit 14 cotswold
messy_df <-
tibble(messy = messy) %>%
rownames_to_column('messy_id') %>%
unnest_tokens(words, messy, 'words', drop = FALSE)
messy_df %>% head
# # A tibble: 6 x 3
# messy_id messy words
# <chr> <chr> <chr>
# 1 1 Pit 28 pit
# 2 1 Pit 28 28
# 3 2 28 28
# 4 3 CWP Pit 28 cwp
# 5 3 CWP Pit 28 pit
# 6 3 CWP Pit 28 28
最后,将两个dataframes在word级别join,统计重叠的单词有多少,然后给每个“乱七八糟”的字符串分配一个“认可的”
## join the data sets and rank by the number of words in the overlap
messy_df %>% left_join(approved_df) %>%
group_by(messy, messy_id, approved, approved_id) %>%
summarise(n_row = n()) %>%
ungroup %>%
group_by(messy, messy_id) %>%
mutate(approved_rank = rank(desc(n_row))) %>%
ungroup %>%
filter(approved_rank == 1) %>%
arrange(messy_id)
# Joining, by = "words"
# # A tibble: 6 x 6
# messy messy_id approved approved_id n_row approved_rank
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 Pit 28 1 Cotswold Water Park Pit 28 1 2 1
# 2 28 2 Cotswold Water Park Pit 28 1 1 1
# 3 CWP Pit 28 3 Cotswold Water Park Pit 28 1 2 1
# 4 Cotswold 28 4 Cotswold Water Park Pit 28 1 2 1
# 5 14 5 Cotswold Water Park Pit 14 2 1 1
# 6 Robinswood 6 Robinswood Hill 3 1 1
我不确定我下面的尝试是否符合您的目的
res <- within(
expand.grid(messy, approved),
matched <- do.call(
function(...) lengths(mapply(intersect, ...)) > 0,
unname(expand.grid(strsplit(messy, " "), strsplit(approved, " ")))
)
)
给予
Var1 Var2 matched
1 Pit 28 Cotswold Water Park Pit 28 TRUE
2 28 Cotswold Water Park Pit 28 TRUE
3 CWP Pit 28 Cotswold Water Park Pit 28 TRUE
4 Cotswold 28 Cotswold Water Park Pit 28 TRUE
5 14 Cotswold Water Park Pit 28 FALSE
6 Robinswood Cotswold Water Park Pit 28 FALSE
7 Pit 28 Cotswold Water Park Pit 14 TRUE
8 28 Cotswold Water Park Pit 14 FALSE
9 CWP Pit 28 Cotswold Water Park Pit 14 TRUE
10 Cotswold 28 Cotswold Water Park Pit 14 TRUE
11 14 Cotswold Water Park Pit 14 TRUE
12 Robinswood Cotswold Water Park Pit 14 FALSE
13 Pit 28 Robinswood Hill FALSE
14 28 Robinswood Hill FALSE
15 CWP Pit 28 Robinswood Hill FALSE
16 Cotswold 28 Robinswood Hill FALSE
17 14 Robinswood Hill FALSE
18 Robinswood Robinswood Hill TRUE
如果您想在 post 中显示输出,您可以在 res
上进一步玩一些技巧,例如,
res2 <- do.call(
cbind,
lapply(
u <- with(subset(res, matched), split(Var1, Var2)),
function(x) `length<-`(as.vector(x), max(lengths(u)))
)
)
这样
> res2
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "Cotswold 28" NA
[4,] "Cotswold 28" "14" NA
也许您正在寻找 adist
:
x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
基本 R 选项为:
result <- sapply(approved, function(x) grep(gsub('\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
这里的逻辑是我们在 approved
中的每个空格处插入竖线 (|
) 符号,并且 return 在 messy
中的单词(如果任何单词匹配)。
要获得与所示格式相同的输出,我们可以这样做:
sapply(result, `[`, 1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA
这是一个高度灵活的 regex_join 解决方案
library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]
# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood
这是使用 stringi
的一种可能性(比 stringr
快,通常比基本 R 正则表达式操作快。这个解决方案 returns 一个列表应该比长度可变时的矩阵。
library(stringi)
messy_ors <- stri_replace_all(messy, " ", "|")
lapply(approved, function(x) messy[stri_detect(x, regex = messy_ors)])
$`Cotswold Water Park Pit 28`
[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
$`Cotswold Water Park Pit 14`
[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
$`Robinswood Hill`
[1] "Robinswood"
如果你真的需要一个矩阵,你可以用类似的东西转换输出:
n <- max(lengths(out))
sapply(out, function(x) x[1:n])