使用 gsub 在 R 中将中间单词保留在由破折号分隔的短语中

Question

我有以下内容：

x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

我想保留 "Paulista"、"Mineiro"、"Carioca"

我正在尝试 gsub

y <- gsub("\$-*","",x)

但是没有用。

Answer 1

我们只需调用一次 sub:

x <- c(" Sao Paulo - Paulista - SP",
       "Minas Gerais - Mineiro - MG",
       "Rio de Janeiro - Carioca -RJ")

sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro"  "Carioca"

我们的想法是捕捉每个位置两条破折号之间发生的任何事情。

^.*-\s+   from the start, consume everything up to and including the first dash
(.*?)      then match and capture everything up until the second dash
\s+-.*$   consume everything after and including the second dash

Answer 2

两个快速方法：

x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

第一个是标准 sub 解决方案；如果有没有连字符的字符串，它将 return 完整的字符串未修改。

trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro"  "Carioca"

在sub内：

"^[^-]*-([^-]*)-.*$"
 ^                   beginning of each string, avoids mid-string matches
  [^-]*              matches 0 or more non-hyphen characters
       -             literal hyphen
        ([^-]*)      matches and stores 0 or more non-hyphen charactesr
               -     literal hyphen
                .*   0 or more of anything (incl hyphens)
                  5  end of each string

"\1"                replace everything that matches with the stored substring

下一个通过 "-" 将字符串拆分为 list，然后为第二个元素编制索引。如果有没有连字符的字符串，则会出现 subscript out of bounds.

错误

trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro"  "Carioca"

对 strsplit 的示例调用：

strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista "  " SP"

... 所以第二个元素是 Paulista（带有额外的 leading/trailing 空格）。周围的 sapply 总是抓取第二个元素（这是字符串不匹配时的错误）。

两种解决方案都使用 trimws 来减少前导和尾随空格。

使用 gsub 在 R 中将中间单词保留在由破折号分隔的短语中

Keep the middle words in a phrase separated by dashes in R using gsub

r

gsub