使用 gsub 在 R 中将中间单词保留在由破折号分隔的短语中

Keep the middle words in a phrase separated by dashes in R using gsub

我有以下内容:

x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

我想保留 "Paulista"、"Mineiro"、"Carioca"

我正在尝试 gsub

y <- gsub("\$-*","",x)

但是没有用。

我们只需调用一次 sub:

x <- c(" Sao Paulo - Paulista - SP",
       "Minas Gerais - Mineiro - MG",
       "Rio de Janeiro - Carioca -RJ")

sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro"  "Carioca"

我们的想法是捕捉每个位置两条破折号之间发生的任何事情。

^.*-\s+   from the start, consume everything up to and including the first dash
(.*?)      then match and capture everything up until the second dash
\s+-.*$   consume everything after and including the second dash

两个快速方法:

x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

第一个是标准 sub 解决方案;如果有没有连字符的字符串,它将 return 完整的字符串未修改。

trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro"  "Carioca" 

sub内:

"^[^-]*-([^-]*)-.*$"
 ^                   beginning of each string, avoids mid-string matches
  [^-]*              matches 0 or more non-hyphen characters
       -             literal hyphen
        ([^-]*)      matches and stores 0 or more non-hyphen charactesr
               -     literal hyphen
                .*   0 or more of anything (incl hyphens)
                  5  end of each string

"\1"                replace everything that matches with the stored substring

下一个通过 "-" 将字符串拆分为 list,然后为第二个元素编制索引。如果有没有连字符的字符串,则会出现 subscript out of bounds.

错误
trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro"  "Carioca" 

strsplit 的示例调用:

strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista "  " SP"        

... 所以第二个元素是 Paulista(带有额外的 leading/trailing 空格)。周围的 sapply 总是抓取第二个元素(这是字符串不匹配时的错误)。

两种解决方案都使用 trimws 来减少前导和尾随空格。