R:按行部分匹配到另一列时替换字符串
R: Replace string when partial match to another column by row
我想 replace/remove 字符串 (name
) 中与我的数据中的其他列 (state
和 city
) 匹配的那些部分 table.
我设法识别了行,例如与城市,像这样:
dt%>% filter(str_detect(name, city))
但我缺少将 gsub
(或 grep
)与列城市的行值一起使用的方法。
我知道一种相当手动的方法,比如将所有城市名称存储在一个向量中并将它们输入 gsub
是可行的,但它也会错误地删除第 2 行的“dallas”。(这对于虽然声明并且可以与 gsub 结合使用以删除“of”。)
数据和期望的输出
dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"),
state=c("pr", "pr", "pr", "ny", "tx"),
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"),
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))
这是一个解决方案,但使用 gsub
方法可能会更快地实现。不管怎样:
library(tidyverse)
dt %>%
mutate(test = str_remove_all(name,city)) %>%
mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>%
mutate(test = str_remove_all(test,state)) %>%
mutate(test = str_remove_all(test,"^ ")) %>%
mutate(test = str_remove_all(test," *$"))
输出:
city state name desired test
1: arecibo pr frutas of pr arecibo frutas frutas
2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo pr cabo rojo metal plant metal plant metal plant
4: new york ny greens new york greens greens
5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
有了dplyr,我们就可以使用rowwise
。首先用 OR 元字符折叠所有要删除的单词到单个字符元素中(如 'arecibo|pr|of'
),然后用该模式调用 str_remove_all
。
最后,删除剩余的空格。
library(dplyr)
library(stringr)
dt %>%
rowwise()%>%
mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
trimws())
# A tibble: 5 × 5
# Rowwise:
city state name desired desired_2
<chr> <chr> <chr> <chr> <chr>
1 arecibo pr frutas of pr arecibo frutas frutas
2 arecibo pr dallas frutas of pr dallas frutas dallas frutas
3 cabo rojo pr cabo rojo metal plant metal plant metal plant
4 new york ny greens new york greens greens
5 dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
一个data.table
解决方案:
# Helper function
subxy <- function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)
dt[, desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]
# city state name desired desired2
# 1: arecibo pr frutas of pr arecibo frutas frutas
# 2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo pr cabo rojo metal plant metal plant metal plant
# 4: new york ny greens new york greens greens
# 5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
我想 replace/remove 字符串 (name
) 中与我的数据中的其他列 (state
和 city
) 匹配的那些部分 table.
我设法识别了行,例如与城市,像这样:
dt%>% filter(str_detect(name, city))
但我缺少将 gsub
(或 grep
)与列城市的行值一起使用的方法。
我知道一种相当手动的方法,比如将所有城市名称存储在一个向量中并将它们输入 gsub
是可行的,但它也会错误地删除第 2 行的“dallas”。(这对于虽然声明并且可以与 gsub 结合使用以删除“of”。)
数据和期望的输出
dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"),
state=c("pr", "pr", "pr", "ny", "tx"),
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"),
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))
这是一个解决方案,但使用 gsub
方法可能会更快地实现。不管怎样:
library(tidyverse)
dt %>%
mutate(test = str_remove_all(name,city)) %>%
mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>%
mutate(test = str_remove_all(test,state)) %>%
mutate(test = str_remove_all(test,"^ ")) %>%
mutate(test = str_remove_all(test," *$"))
输出:
city state name desired test
1: arecibo pr frutas of pr arecibo frutas frutas
2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo pr cabo rojo metal plant metal plant metal plant
4: new york ny greens new york greens greens
5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
有了dplyr,我们就可以使用rowwise
。首先用 OR 元字符折叠所有要删除的单词到单个字符元素中(如 'arecibo|pr|of'
),然后用该模式调用 str_remove_all
。
最后,删除剩余的空格。
library(dplyr)
library(stringr)
dt %>%
rowwise()%>%
mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
trimws())
# A tibble: 5 × 5
# Rowwise:
city state name desired desired_2
<chr> <chr> <chr> <chr> <chr>
1 arecibo pr frutas of pr arecibo frutas frutas
2 arecibo pr dallas frutas of pr dallas frutas dallas frutas
3 cabo rojo pr cabo rojo metal plant metal plant metal plant
4 new york ny greens new york greens greens
5 dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
一个data.table
解决方案:
# Helper function
subxy <- function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)
dt[, desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]
# city state name desired desired2
# 1: arecibo pr frutas of pr arecibo frutas frutas
# 2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo pr cabo rojo metal plant metal plant metal plant
# 4: new york ny greens new york greens greens
# 5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes