向量中带有 () 的 substring()

substring() with () in the vector

我想将此栏分成 2 栏,使用第一个括号作为分隔符。我使用了 word(x,2, sep = "(") 但我得到了一个错误。我知道 R 不喜欢括号 sep。我想使用 "(" 作为 sep 因为数据的记录不一致,在某些行上,我们在国家和州之间有 space,在其他一些行上我们没有。

我该如何解决这个问题?谢谢。

x <- c("United States (Alabama) ", "United States (California) ", 
"United States (California) ", "United States (California) ", 
"United States (California) ", "United States (Colorado) ", 
"United States (Colorado) ", "United States (Colorado) ", 
"United States(Connecticut) ", "United States(Connecticut) "
)
word(x,2,sep = "("). 

Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN, context=()

我假设您正在使用 stringr::word?如果是,则 sep 参数被解释为正则表达式。你可以像这样逃避'(':

word(x, 2, sep = "\(")

或者您可以使用 fixed 函数:

word(x, 2, sep = fixed("("))

我想你可以像下面那样尝试 strsplit + gsub

trimws(
  gsub(
    "\(|\)",
    "",
    do.call(
      rbind,
      strsplit(x,
        "((?<=\s)\()|(?=)\(",
        perl = TRUE
      )
    )
  )
)

这给出了

      [,1]            [,2]
 [1,] "United States" "Alabama"
 [2,] "United States" "California"
 [3,] "United States" "California"
 [4,] "United States" "California" 
 [5,] "United States" "California"
 [6,] "United States" "Colorado"
 [7,] "United States" "Colorado"
 [8,] "United States" "Colorado"
 [9,] "United States" "Connecticut"
[10,] "United States" "Connecticut"

另一种方法是使用此正则表达式 \(([^\)]+)\),它将捕获状态(假设这就是您所追求的)。

library(gsubfn)
strapplyc(x = df, pattern = "\(([^\)]+)\)")

生成列表

[1] "Alabama"

[[2]]
[1] "California"

[[3]]
[1] "California"

[[4]]
[1] "California"

[[5]]
[1] "California"

[[6]]
[1] "Colorado"

[[7]]
[1] "Colorado"

[[8]]
[1] "Colorado"

[[9]]
[1] "Connecticut"

[[10]]
[1] "Connecticut"

这是一个sub/scan方法。

matrix(trimws(scan(what = character(), text = sub("\)", "", x), sep = "(")), ncol = 2, byrow = TRUE)
#Read 20 items
#      [,1]            [,2]         
# [1,] "United States" "Alabama"    
# [2,] "United States" "California" 
# [3,] "United States" "California" 
# [4,] "United States" "California" 
# [5,] "United States" "California" 
# [6,] "United States" "Colorado"   
# [7,] "United States" "Colorado"   
# [8,] "United States" "Colorado"   
# [9,] "United States" "Connecticut"
#[10,] "United States" "Connecticut"

说明

上面的代码使用 scan 通过 sep 字符将每个字符串一分为二。由于 sep 的长度必须为 1,因此请选择 "("")" 之一。我选择了"(",所以之前,删除另一个")"sub。剩下的就简单了。

指令顺序为:

tmp <- sub("\)", "", x)
tmp <- scan(what = character(), text = tmp, sep = "(")
tmp <- trimws(tmp)
matrix(tmp, ncol = 2, byrow = TRUE)