如何 select 精确匹配变量列表以附加数据集
How to select the exact matches for a list of variables to append datasets
我针对不同的波有不同的数据集。每个 wave 都有自己的数据集和变量名称前缀。我正在尝试使用我需要的变量子集导入和附加所有数据文件。因此,我目前正在做:
var_list <- c("pidp", "jbsat", "jbhrs", "jbnssec8_dv", "panssec8_dv", "manssec8_dv", "paedqf", "maedqf", "qfhigh", "age_dv",
"sex_dv", "psu", "strata", "employ", "jbhas", "jboff", "jbsem", "jbstat", "jbterm1", "jbterm2", "pjbptft", "fimnet_dv",
"fimngrs_dv", "fimnlabnet_dv", "seearnnet_dv", "fimnmisc_dv", "fimnprben_dv", "fimninvent_dv", "fimnpen_dv", "fimnsben_dv",
"hhtype_dv", "livesp_dv", "nch14resp", "nmpsp_dv", "tenure_dv", "urban_dv", "jbsat", "health", "sf1", "scghqa",
"scghqb", "scghqc", "scghqd", "scgqhe", "scgqhf", "scghqg", "scghqi", "scghqj", "scghqh", "scghql", "sclsat1",
"sclsat2", "sclsat3", "sclsat4", "indscus_lw", "indscub_xw")
然后导入第一个 wave 的数据,selecting 这些变量并删除 wave-prefix:
longfile <- read_dta(file=paste0(dir, "ukhls_w1/a_indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with("a_")), ~str_replace(.,"a_", "")) %>% #remove the wave prefix
mutate(wave = 1)
此时,我将简单地使用以下循环:
for (wn in 2:10) {
wl <- paste0(letters[wn],"_")
wave_data <- read_dta(paste0(dir, "ukhls_w", wn, "/", wl, "indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with(wl)), ~str_replace(.,wl, "")) %>% # remove prefix wave
mutate(wave = wn)
longfile <- rbind(longfile, wave_data)
}
但是,问题在于一些变量名称与后续波的文件中的多个列匹配。例如,在第二波中它存在一个名为“nxtjbhrs”的变量,因此当它匹配“jbhrs”时将被包括在内。这将在 rbind 中产生错误,因为列数会有所不同。
在这种情况下,我如何select 完全匹配?或者强制附加数据集?
感谢您的支持!
select(setdiff(names(.), var_list))
我针对不同的波有不同的数据集。每个 wave 都有自己的数据集和变量名称前缀。我正在尝试使用我需要的变量子集导入和附加所有数据文件。因此,我目前正在做:
var_list <- c("pidp", "jbsat", "jbhrs", "jbnssec8_dv", "panssec8_dv", "manssec8_dv", "paedqf", "maedqf", "qfhigh", "age_dv",
"sex_dv", "psu", "strata", "employ", "jbhas", "jboff", "jbsem", "jbstat", "jbterm1", "jbterm2", "pjbptft", "fimnet_dv",
"fimngrs_dv", "fimnlabnet_dv", "seearnnet_dv", "fimnmisc_dv", "fimnprben_dv", "fimninvent_dv", "fimnpen_dv", "fimnsben_dv",
"hhtype_dv", "livesp_dv", "nch14resp", "nmpsp_dv", "tenure_dv", "urban_dv", "jbsat", "health", "sf1", "scghqa",
"scghqb", "scghqc", "scghqd", "scgqhe", "scgqhf", "scghqg", "scghqi", "scghqj", "scghqh", "scghql", "sclsat1",
"sclsat2", "sclsat3", "sclsat4", "indscus_lw", "indscub_xw")
然后导入第一个 wave 的数据,selecting 这些变量并删除 wave-prefix:
longfile <- read_dta(file=paste0(dir, "ukhls_w1/a_indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with("a_")), ~str_replace(.,"a_", "")) %>% #remove the wave prefix
mutate(wave = 1)
此时,我将简单地使用以下循环:
for (wn in 2:10) {
wl <- paste0(letters[wn],"_")
wave_data <- read_dta(paste0(dir, "ukhls_w", wn, "/", wl, "indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with(wl)), ~str_replace(.,wl, "")) %>% # remove prefix wave
mutate(wave = wn)
longfile <- rbind(longfile, wave_data)
}
但是,问题在于一些变量名称与后续波的文件中的多个列匹配。例如,在第二波中它存在一个名为“nxtjbhrs”的变量,因此当它匹配“jbhrs”时将被包括在内。这将在 rbind 中产生错误,因为列数会有所不同。
在这种情况下,我如何select 完全匹配?或者强制附加数据集?
感谢您的支持!
select(setdiff(names(.), var_list))