使用 gsub 在 R 中将中间单词保留在由破折号分隔的短语中
Keep the middle words in a phrase separated by dashes in R using gsub
我有以下内容:
x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
我想保留 "Paulista"、"Mineiro"、"Carioca"
我正在尝试 gsub
y <- gsub("\$-*","",x)
但是没有用。
我们只需调用一次 sub
:
x <- c(" Sao Paulo - Paulista - SP",
"Minas Gerais - Mineiro - MG",
"Rio de Janeiro - Carioca -RJ")
sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro" "Carioca"
我们的想法是捕捉每个位置两条破折号之间发生的任何事情。
^.*-\s+ from the start, consume everything up to and including the first dash
(.*?) then match and capture everything up until the second dash
\s+-.*$ consume everything after and including the second dash
两个快速方法:
x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
第一个是标准 sub
解决方案;如果有没有连字符的字符串,它将 return 完整的字符串未修改。
trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro" "Carioca"
在sub
内:
"^[^-]*-([^-]*)-.*$"
^ beginning of each string, avoids mid-string matches
[^-]* matches 0 or more non-hyphen characters
- literal hyphen
([^-]*) matches and stores 0 or more non-hyphen charactesr
- literal hyphen
.* 0 or more of anything (incl hyphens)
5 end of each string
"\1" replace everything that matches with the stored substring
下一个通过 "-"
将字符串拆分为 list
,然后为第二个元素编制索引。如果有没有连字符的字符串,则会出现 subscript out of bounds
.
错误
trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro" "Carioca"
对 strsplit
的示例调用:
strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista " " SP"
... 所以第二个元素是 Paulista
(带有额外的 leading/trailing 空格)。周围的 sapply
总是抓取第二个元素(这是字符串不匹配时的错误)。
两种解决方案都使用 trimws
来减少前导和尾随空格。
我有以下内容:
x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
我想保留 "Paulista"、"Mineiro"、"Carioca"
我正在尝试 gsub
y <- gsub("\$-*","",x)
但是没有用。
我们只需调用一次 sub
:
x <- c(" Sao Paulo - Paulista - SP",
"Minas Gerais - Mineiro - MG",
"Rio de Janeiro - Carioca -RJ")
sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro" "Carioca"
我们的想法是捕捉每个位置两条破折号之间发生的任何事情。
^.*-\s+ from the start, consume everything up to and including the first dash
(.*?) then match and capture everything up until the second dash
\s+-.*$ consume everything after and including the second dash
两个快速方法:
x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
第一个是标准 sub
解决方案;如果有没有连字符的字符串,它将 return 完整的字符串未修改。
trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro" "Carioca"
在sub
内:
"^[^-]*-([^-]*)-.*$"
^ beginning of each string, avoids mid-string matches
[^-]* matches 0 or more non-hyphen characters
- literal hyphen
([^-]*) matches and stores 0 or more non-hyphen charactesr
- literal hyphen
.* 0 or more of anything (incl hyphens)
5 end of each string
"\1" replace everything that matches with the stored substring
下一个通过 "-"
将字符串拆分为 list
,然后为第二个元素编制索引。如果有没有连字符的字符串,则会出现 subscript out of bounds
.
trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro" "Carioca"
对 strsplit
的示例调用:
strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista " " SP"
... 所以第二个元素是 Paulista
(带有额外的 leading/trailing 空格)。周围的 sapply
总是抓取第二个元素(这是字符串不匹配时的错误)。
两种解决方案都使用 trimws
来减少前导和尾随空格。