向量中带有 () 的 substring()
substring() with () in the vector
我想将此栏分成 2 栏,使用第一个括号作为分隔符。我使用了 word(x,2, sep = "(") 但我得到了一个错误。我知道 R 不喜欢括号 sep
。我想使用 "("
作为 sep
因为数据的记录不一致,在某些行上,我们在国家和州之间有 space,在其他一些行上我们没有。
我该如何解决这个问题?谢谢。
x <- c("United States (Alabama) ", "United States (California) ",
"United States (California) ", "United States (California) ",
"United States (California) ", "United States (Colorado) ",
"United States (Colorado) ", "United States (Colorado) ",
"United States(Connecticut) ", "United States(Connecticut) "
)
word(x,2,sep = "(").
Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, :
Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN, context=(
)
我假设您正在使用 stringr::word
?如果是,则 sep
参数被解释为正则表达式。你可以像这样逃避'(':
word(x, 2, sep = "\(")
或者您可以使用 fixed
函数:
word(x, 2, sep = fixed("("))
我想你可以像下面那样尝试 strsplit
+ gsub
trimws(
gsub(
"\(|\)",
"",
do.call(
rbind,
strsplit(x,
"((?<=\s)\()|(?=)\(",
perl = TRUE
)
)
)
)
这给出了
[,1] [,2]
[1,] "United States" "Alabama"
[2,] "United States" "California"
[3,] "United States" "California"
[4,] "United States" "California"
[5,] "United States" "California"
[6,] "United States" "Colorado"
[7,] "United States" "Colorado"
[8,] "United States" "Colorado"
[9,] "United States" "Connecticut"
[10,] "United States" "Connecticut"
另一种方法是使用此正则表达式 \(([^\)]+)\)
,它将捕获状态(假设这就是您所追求的)。
library(gsubfn)
strapplyc(x = df, pattern = "\(([^\)]+)\)")
生成列表
[1] "Alabama"
[[2]]
[1] "California"
[[3]]
[1] "California"
[[4]]
[1] "California"
[[5]]
[1] "California"
[[6]]
[1] "Colorado"
[[7]]
[1] "Colorado"
[[8]]
[1] "Colorado"
[[9]]
[1] "Connecticut"
[[10]]
[1] "Connecticut"
这是一个sub/scan
方法。
matrix(trimws(scan(what = character(), text = sub("\)", "", x), sep = "(")), ncol = 2, byrow = TRUE)
#Read 20 items
# [,1] [,2]
# [1,] "United States" "Alabama"
# [2,] "United States" "California"
# [3,] "United States" "California"
# [4,] "United States" "California"
# [5,] "United States" "California"
# [6,] "United States" "Colorado"
# [7,] "United States" "Colorado"
# [8,] "United States" "Colorado"
# [9,] "United States" "Connecticut"
#[10,] "United States" "Connecticut"
说明
上面的代码使用 scan
通过 sep
字符将每个字符串一分为二。由于 sep
的长度必须为 1,因此请选择 "("
或 ")"
之一。我选择了"("
,所以之前,删除另一个")"
和sub
。剩下的就简单了。
指令顺序为:
tmp <- sub("\)", "", x)
tmp <- scan(what = character(), text = tmp, sep = "(")
tmp <- trimws(tmp)
matrix(tmp, ncol = 2, byrow = TRUE)
我想将此栏分成 2 栏,使用第一个括号作为分隔符。我使用了 word(x,2, sep = "(") 但我得到了一个错误。我知道 R 不喜欢括号 sep
。我想使用 "("
作为 sep
因为数据的记录不一致,在某些行上,我们在国家和州之间有 space,在其他一些行上我们没有。
我该如何解决这个问题?谢谢。
x <- c("United States (Alabama) ", "United States (California) ",
"United States (California) ", "United States (California) ",
"United States (California) ", "United States (Colorado) ",
"United States (Colorado) ", "United States (Colorado) ",
"United States(Connecticut) ", "United States(Connecticut) "
)
word(x,2,sep = "(").
Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN, context=
(
)
我假设您正在使用 stringr::word
?如果是,则 sep
参数被解释为正则表达式。你可以像这样逃避'(':
word(x, 2, sep = "\(")
或者您可以使用 fixed
函数:
word(x, 2, sep = fixed("("))
我想你可以像下面那样尝试 strsplit
+ gsub
trimws(
gsub(
"\(|\)",
"",
do.call(
rbind,
strsplit(x,
"((?<=\s)\()|(?=)\(",
perl = TRUE
)
)
)
)
这给出了
[,1] [,2]
[1,] "United States" "Alabama"
[2,] "United States" "California"
[3,] "United States" "California"
[4,] "United States" "California"
[5,] "United States" "California"
[6,] "United States" "Colorado"
[7,] "United States" "Colorado"
[8,] "United States" "Colorado"
[9,] "United States" "Connecticut"
[10,] "United States" "Connecticut"
另一种方法是使用此正则表达式 \(([^\)]+)\)
,它将捕获状态(假设这就是您所追求的)。
library(gsubfn)
strapplyc(x = df, pattern = "\(([^\)]+)\)")
生成列表
[1] "Alabama"
[[2]]
[1] "California"
[[3]]
[1] "California"
[[4]]
[1] "California"
[[5]]
[1] "California"
[[6]]
[1] "Colorado"
[[7]]
[1] "Colorado"
[[8]]
[1] "Colorado"
[[9]]
[1] "Connecticut"
[[10]]
[1] "Connecticut"
这是一个sub/scan
方法。
matrix(trimws(scan(what = character(), text = sub("\)", "", x), sep = "(")), ncol = 2, byrow = TRUE)
#Read 20 items
# [,1] [,2]
# [1,] "United States" "Alabama"
# [2,] "United States" "California"
# [3,] "United States" "California"
# [4,] "United States" "California"
# [5,] "United States" "California"
# [6,] "United States" "Colorado"
# [7,] "United States" "Colorado"
# [8,] "United States" "Colorado"
# [9,] "United States" "Connecticut"
#[10,] "United States" "Connecticut"
说明
上面的代码使用 scan
通过 sep
字符将每个字符串一分为二。由于 sep
的长度必须为 1,因此请选择 "("
或 ")"
之一。我选择了"("
,所以之前,删除另一个")"
和sub
。剩下的就简单了。
指令顺序为:
tmp <- sub("\)", "", x)
tmp <- scan(what = character(), text = tmp, sep = "(")
tmp <- trimws(tmp)
matrix(tmp, ncol = 2, byrow = TRUE)