解析带有多个括号的字符串
Parsing a string with multiple brackets
我有一个包含列 "subject
" 的数据集 dt
,我需要对其进行解析。例如,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
因此,我想得到以下 table:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
我该怎么做?
由于您有多个括号可以从中提取数据,因此您需要使正则表达式变得惰性。
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\((.*?)\)\((.*)\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
另一个具有更简单正则表达式的选项是使用 gsub
删除圆括号,并使用 separate
和 sep
参数作为空格。
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\s+")
数据
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))
我有一个包含列 "subject
" 的数据集 dt
,我需要对其进行解析。例如,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
因此,我想得到以下 table:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
我该怎么做?
由于您有多个括号可以从中提取数据,因此您需要使正则表达式变得惰性。
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\((.*?)\)\((.*)\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
另一个具有更简单正则表达式的选项是使用 gsub
删除圆括号,并使用 separate
和 sep
参数作为空格。
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\s+")
数据
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))