解析带有多个括号的字符串

Parsing a string with multiple brackets

我有一个包含列 "subject" 的数据集 dt,我需要对其进行解析。例如,

ID    subject   

1     USA(Texas)(Austin)
2     USA(California)(Sacramento)

因此,我想得到以下 table:

ID    subject                       Country     State        Capital   

1     USA(Texas)(Austin)            USA         Texas        Austin
2     USA(California)(Sacramento)   USA         California   Sacramento

我该怎么做?

由于您有多个括号可以从中提取数据,因此您需要使正则表达式变得惰性。

library(dplyr)
library(tidyr)

extract(dt, subject, into = c("Country", "State", "Capital"),
              regex = "(.*)\((.*?)\)\((.*)\)", remove = FALSE)

#  ID                     subject Country      State    Capital
#1  1          USA(Texas)(Austin)     USA      Texas     Austin
#2  2 USA(California)(Sacramento)     USA California Sacramento

另一个具有更简单正则表达式的选项是使用 gsub 删除圆括号,并使用 separatesep 参数作为空格。

dt %>%
  mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
  separate(subject, into = c("Country", "State", "Capital"), sep = "\s+")

数据

dt <- structure(list(ID = 1:2, subject = structure(2:1, 
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"), 
class = "factor")), class = "data.frame", row.names = c(NA, -2L))