R：将文本分离到新列中，即在 R 中进行变异

Question

我有一个非常大的数据集 (189k)，其中有超过 250 个变量，其中变量是使用基于 Web 的复选框插入的。然后将一些变量组合成单行，例如医学合并症：

这个变量有大约 1500 种医疗条件组合，如上一行。我想变异成单独的列，例如 Col 1：高血压（Yes/No），Col 2：糖尿病（Yes/No）...以便可以使用是否存在预先存在的条件作为预测变量。

有没有办法在 R 中对此进行编码？

Answer 1

在我看来，您的数据中包含所有医疗条件的变量，由 | 分隔 — 类似于：

   Patient                                       Comorbids
1:  D88310                       Diabetes|Obesity (BMI>35)
2:   B9939                                            <NA>
3:   J3923                   Hypertension|Obesity (BMI>35)
4:  H09203 Hypertension|Diabetes|Chronic Pulmonary Disease

使用 data.table 包中的 tstrsplit() 函数将其拆分，grepl() 您可以对每个患者的每种疾病的存在进行评分：

# Remove braces (pay attention to these sorts of issues)
data1[, Comorbids := gsub("\(|\)", "", Comorbids)]

# Split the strings into individual values - unique used to find all unique values
conditions <- unique(unlist(tstrsplit(data1[, Comorbids], "\|")))
conditions <- conditions[!is.na(conditions)]

# Score the occurence and add on to data
data2 <- data.table(data1[, -c("Comorbids")], 
                    sapply(conditions, grepl, data1[, Comorbids]))

给予：

   Patient Diabetes Hypertension Obesity BMI>35 Chronic Pulmonary Disease
1:  D88310     TRUE        FALSE           TRUE                     FALSE
2:   B9939    FALSE        FALSE          FALSE                     FALSE
3:   J3923    FALSE         TRUE           TRUE                     FALSE
4:  H09203     TRUE         TRUE          FALSE                      TRUE

R：将文本分离到新列中，即在 R 中进行变异

R: Text separation into new columns ie mutate in R

text

r