R从列中提取多个变量

Question

我是 R 的新手，如果不清楚，我深表歉意。

我的数据包含 3 个变量列的 1,000 个观察值：(a) 人，(b) 小插图，(c) 响应。 vignette 列包含段落中呈现的人口统计信息，包括年龄（20、80）、性别（男性、女性）、就业（就业、未就业、退休）等。每个人都会收到一个随机呈现其中一个值的 vignette年龄（20 或 80 岁）、性别（男性或女性）、就业（就业、未就业、退休）等

(e.x。收到的人#1：A(n) 20 岁男性失业。收到的人#2：A(n) 80 岁女性退休。收到的人#3：A (n) 20 岁男性失业...第 1,000 人收到：A(n) 20 岁女性就业。)

我正在尝试在 (b) 小插图上使用 tidyr:extract 来提取其余的人口统计信息并创建几个标记为 "age" 的新变量列， "sex" "employment" 等。到目前为止，我只能使用以下代码提取 "age"：

tidyr::extract(data, vignette, c("age"), "([20:80]+)")

我想提取所有人口统计信息并为 (b) 年龄、(c) 性别、(d) 就业等创建变量列。我的目标是拥有 1,000 个观察行，其中包含多个像这样的变量列:

(a) person, (b) age, (c) sex, (d) employment (e) response 
Person #1       20      Male       unemployed     Very Likely
Person #2       80      Female     retired        Somewhat Likely
Person #3       20      Male       unemployed     Very Unlikely
...
Person #1,000  20      Female     employed       Neither Likely nor Unlikely

插图示例：

structure(list(Response_ID = "R_86Tm81WUuyFBZhH", Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

感谢任何指导或帮助！

Answer 1

我编写了一些正则表达式来提取您的信息。经验表明，在获得令人满意的结果之前，您将花费很多时间来调整正则表达式。例如。您不会从 "Neither she nor her boyfriend are employed"

这样的句子中正确提取就业状态

raw <- structure(list(Response_ID = "R_86Tm81WUuyFBZhH", 
                      Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", 
                      Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
raw2 <- raw %>% 
  add_row(Response_ID = "R_xesrew",
               Vignette = "A 22 year-old White boy drinks bleach.  He is unemployed",
               Response = "Unlikely")


rzlt <- raw2 %>% 
  tidyr::extract(Vignette, "Age", "(?ix) (\d+) \s* year\-old", remove = FALSE) %>% 
  tidyr::extract(Vignette, "Race", "(?ix) (hispanic|white|asian|black|native \s* american)", remove = FALSE) %>% 
  tidyr::extract(Vignette, "Job", "(?ix) (not \s+ employed|unemployed|employed|jobless)", remove = FALSE) %>% 
  tidyr::extract(Vignette, "Sex", "(?ix) (female|male|woman|man|boy|girl)", remove = FALSE) %>% 
  select(- Vignette)

给予

# A tibble: 2 x 6
  Response_ID       Sex   Job        Race     Age   Response   
  <chr>             <chr> <chr>      <chr>    <chr> <chr>      
1 R_86Tm81WUuyFBZhH woman employed   Hispanic 18    Very Likely
2 R_xesrew          boy   unemployed White    22    Unlikely

保存您的工作

library(readr)
write_csv(rzlt, "myResponses.csv")

或者

library(openxlsx)
openxlsx::write.xlsx(rzlt, "myResponses.xlsx", asTable = TRUE)

R从列中提取多个变量

R extract multiple variables from column

r

extract

tidyr