R从列中提取多个变量
R extract multiple variables from column
我是 R 的新手,如果不清楚,我深表歉意。
我的数据包含 3 个变量列的 1,000 个观察值:(a) 人,(b) 小插图,(c) 响应。 vignette 列包含段落中呈现的人口统计信息,包括年龄(20、80)、性别(男性、女性)、就业(就业、未就业、退休)等。每个人都会收到一个随机呈现其中一个值的 vignette年龄(20 或 80 岁)、性别(男性或女性)、就业(就业、未就业、退休)等
(e.x。收到的人#1:A(n) 20 岁男性失业。收到的人#2:A(n) 80 岁女性退休。收到的人#3:A (n) 20 岁男性失业...第 1,000 人收到:A(n) 20 岁女性就业。)
我正在尝试在 (b) 小插图上使用 tidyr:extract 来提取其余的人口统计信息并创建几个标记为 "age" 的新变量列, "sex" "employment" 等。到目前为止,我只能使用以下代码提取 "age":
tidyr::extract(data, vignette, c("age"), "([20:80]+)")
我想提取所有人口统计信息并为 (b) 年龄、(c) 性别、(d) 就业等创建变量列。我的目标是拥有 1,000 个观察行,其中包含多个像这样的变量列:
(a) person, (b) age, (c) sex, (d) employment (e) response
Person #1 20 Male unemployed Very Likely
Person #2 80 Female retired Somewhat Likely
Person #3 20 Male unemployed Very Unlikely
...
Person #1,000 20 Female employed Neither Likely nor Unlikely
插图示例:
structure(list(Response_ID = "R_86Tm81WUuyFBZhH", Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
感谢任何指导或帮助!
我编写了一些正则表达式来提取您的信息。经验表明,在获得令人满意的结果之前,您将花费很多时间来调整正则表达式。例如。您不会从 "Neither she nor her boyfriend are employed"
这样的句子中正确提取就业状态
raw <- structure(list(Response_ID = "R_86Tm81WUuyFBZhH",
Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?",
Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
raw2 <- raw %>%
add_row(Response_ID = "R_xesrew",
Vignette = "A 22 year-old White boy drinks bleach. He is unemployed",
Response = "Unlikely")
rzlt <- raw2 %>%
tidyr::extract(Vignette, "Age", "(?ix) (\d+) \s* year\-old", remove = FALSE) %>%
tidyr::extract(Vignette, "Race", "(?ix) (hispanic|white|asian|black|native \s* american)", remove = FALSE) %>%
tidyr::extract(Vignette, "Job", "(?ix) (not \s+ employed|unemployed|employed|jobless)", remove = FALSE) %>%
tidyr::extract(Vignette, "Sex", "(?ix) (female|male|woman|man|boy|girl)", remove = FALSE) %>%
select(- Vignette)
给予
# A tibble: 2 x 6
Response_ID Sex Job Race Age Response
<chr> <chr> <chr> <chr> <chr> <chr>
1 R_86Tm81WUuyFBZhH woman employed Hispanic 18 Very Likely
2 R_xesrew boy unemployed White 22 Unlikely
保存您的工作
library(readr)
write_csv(rzlt, "myResponses.csv")
或者
library(openxlsx)
openxlsx::write.xlsx(rzlt, "myResponses.xlsx", asTable = TRUE)
我是 R 的新手,如果不清楚,我深表歉意。
我的数据包含 3 个变量列的 1,000 个观察值:(a) 人,(b) 小插图,(c) 响应。 vignette 列包含段落中呈现的人口统计信息,包括年龄(20、80)、性别(男性、女性)、就业(就业、未就业、退休)等。每个人都会收到一个随机呈现其中一个值的 vignette年龄(20 或 80 岁)、性别(男性或女性)、就业(就业、未就业、退休)等
(e.x。收到的人#1:A(n) 20 岁男性失业。收到的人#2:A(n) 80 岁女性退休。收到的人#3:A (n) 20 岁男性失业...第 1,000 人收到:A(n) 20 岁女性就业。)
我正在尝试在 (b) 小插图上使用 tidyr:extract 来提取其余的人口统计信息并创建几个标记为 "age" 的新变量列, "sex" "employment" 等。到目前为止,我只能使用以下代码提取 "age":
tidyr::extract(data, vignette, c("age"), "([20:80]+)")
我想提取所有人口统计信息并为 (b) 年龄、(c) 性别、(d) 就业等创建变量列。我的目标是拥有 1,000 个观察行,其中包含多个像这样的变量列:
(a) person, (b) age, (c) sex, (d) employment (e) response
Person #1 20 Male unemployed Very Likely
Person #2 80 Female retired Somewhat Likely
Person #3 20 Male unemployed Very Unlikely
...
Person #1,000 20 Female employed Neither Likely nor Unlikely
插图示例:
structure(list(Response_ID = "R_86Tm81WUuyFBZhH", Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
感谢任何指导或帮助!
我编写了一些正则表达式来提取您的信息。经验表明,在获得令人满意的结果之前,您将花费很多时间来调整正则表达式。例如。您不会从 "Neither she nor her boyfriend are employed"
这样的句子中正确提取就业状态raw <- structure(list(Response_ID = "R_86Tm81WUuyFBZhH",
Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?",
Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
raw2 <- raw %>%
add_row(Response_ID = "R_xesrew",
Vignette = "A 22 year-old White boy drinks bleach. He is unemployed",
Response = "Unlikely")
rzlt <- raw2 %>%
tidyr::extract(Vignette, "Age", "(?ix) (\d+) \s* year\-old", remove = FALSE) %>%
tidyr::extract(Vignette, "Race", "(?ix) (hispanic|white|asian|black|native \s* american)", remove = FALSE) %>%
tidyr::extract(Vignette, "Job", "(?ix) (not \s+ employed|unemployed|employed|jobless)", remove = FALSE) %>%
tidyr::extract(Vignette, "Sex", "(?ix) (female|male|woman|man|boy|girl)", remove = FALSE) %>%
select(- Vignette)
给予
# A tibble: 2 x 6
Response_ID Sex Job Race Age Response
<chr> <chr> <chr> <chr> <chr> <chr>
1 R_86Tm81WUuyFBZhH woman employed Hispanic 18 Very Likely
2 R_xesrew boy unemployed White 22 Unlikely
保存您的工作
library(readr)
write_csv(rzlt, "myResponses.csv")
或者
library(openxlsx)
openxlsx::write.xlsx(rzlt, "myResponses.xlsx", asTable = TRUE)