R:按行部分匹配到另一列时替换字符串

R: Replace string when partial match to another column by row

我想 replace/remove 字符串 (name) 中与我的数据中的其他列 (statecity) 匹配的那些部分 table.

我设法识别了行,例如与城市,像这样: dt%>% filter(str_detect(name, city)) 但我缺少将 gsub(或 grep)与列城市的行值一起使用的方法。

我知道一种相当手动的方法,比如将所有城市名称存储在一个向量中并将它们输入 gsub 是可行的,但它也会错误地删除第 2 行的“dallas”。(这对于虽然声明并且可以与 gsub 结合使用以删除“of”。)


数据和期望的输出

dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"), 
state=c("pr", "pr", "pr", "ny", "tx"), 
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"), 
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))

这是一个解决方案,但使用 gsub 方法可能会更快地实现。不管怎样:

library(tidyverse)


  dt %>% 
  mutate(test = str_remove_all(name,city)) %>% 
  mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>% 
  mutate(test = str_remove_all(test,state)) %>% 
  mutate(test = str_remove_all(test,"^ ")) %>% 
  mutate(test = str_remove_all(test," *$"))

输出:

        city state                   name       desired          test
1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
4:  new york    ny        greens new york        greens        greens
5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

有了dplyr,我们就可以使用rowwise。首先用 OR 元字符折叠所有要删除的单词到单个字符元素中(如 'arecibo|pr|of'),然后用该模式调用 str_remove_all。 最后,删除剩余的空格。

library(dplyr)
library(stringr)

dt %>%
    rowwise()%>%
    mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
               trimws())

# A tibble: 5 × 5
# Rowwise: 
  city      state name                   desired       desired_2    
  <chr>     <chr> <chr>                  <chr>         <chr>        
1 arecibo   pr    frutas of pr arecibo   frutas        frutas       
2 arecibo   pr    dallas frutas of pr    dallas frutas dallas frutas
3 cabo rojo pr    cabo rojo metal plant  metal plant   metal plant  
4 new york  ny    greens new york        greens        greens       
5 dallas    tx    cowboy shoes dallas tx cowboy shoes  cowboy shoes 

一个data.table解决方案:

# Helper function
subxy <-  function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)

dt[,  desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]

#         city state                   name       desired      desired2
# 1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
# 2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
# 4:  new york    ny        greens new york        greens        greens
# 5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes