如何对 R 数据框中的多个对象进行分类
How to classify multiple objects in an R data frame
这只是我正在使用的数据框的一小部分:
id drug start stop dose unit route
2010003 Amlodipine 2009-02-04 2009-11-19 1.5 mg Oral
2010003 Amlodipine 2009-11-19 2010-01-11 1.5 mg Oral
2010004 Cefprozil 2004-03-12 2004-03-19 175 mg Oral
2010004 Clobazam 2002-12-30 2003-01-01 5 mg Oral
我有一个 Stata do
文件,它显示了我正在尝试做的事情:
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRILAT*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "FOSINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "LISINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "RAMIPRIL*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE/BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE/CLINDAMYCIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ISOTRETINOIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ERYTHROMYCIN/TRETINOIN*")
replace class = "Acne Medication/Acute Promyelocytic Leukemia Medication" if strmatch(upper(drug), "TRETINOIN*")
replace class = "Alpha Agonist" if strmatch(upper(drug), "XYLOMETAZOLINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "DOXAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENOXYBENZAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENTOLAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PRAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TAMSULOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TERAZOSIN*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "CARVEDILOL*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "LABETALOL*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "PHENYLEPHRINE*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "MIDODRINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "CLONIDINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "DEXMEDETOMIDINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "KETAMINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "THIOPENTAL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BENZOCAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE/FENTANYL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "TETRACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "XYLOCAINE*")
replace class = "Anaesthetic, local/Antiarrythmic" if strmatch(upper(drug), "LIDOCAINE*")
replace class = "Anaesthetic, local/Antiseptic" if strmatch(upper(drug), "HEXYLRESORCINOL*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "LIDOCAINE/PRILOCAINE*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "PROPARACAINE*")
replace class = "Analgesic" if strmatch(upper(drug), "ACETAMINOPHEN*")
replace class = "Analgesic" if strmatch(upper(drug), "BELLADONNA & OPIUM SUPPOSITORY*")
我想在R中做同样的分类,但我不会Stata。
注意药物可以有不止一种class
。
如有任何建议和帮助,我们将不胜感激。
作为第一步,我将从您的 Stata 脚本中导入所有药物数据(假设数据不是干净可用的格式):
drug_class_data <- read.table("Desktop/stata_script", header=FALSE, sep='"',stringsAsFactors = FALSE)
drug_class_data <-drug_class_data[,c(2,4)]
colnames(drug_class_data) <- c('Drug_class','Drug')
删除尾随 * - 在 Stata 脚本中用作通配符
drug_class_data$Drug = gsub("\*", "", drug_class_data$Drug)
这为您提供了一个包含 2 列的数据框 ('Drug_class' & 'Drug') - 该行从 Stata 脚本的每一行中提取引号中的任何数据(下面以粗体突出显示):
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
然后我会把它保存为一个文件,然后你可以根据需要导入(我假设这个数据还没有像这样可用,因为你已经在 Stata 示例中硬编码了所有这些值)。
write.csv(drug_class_data, file = "drug_class_data.csv",row.names=FALSE)
从那里取决于你是否想要:
1) 每个药物实例的多行具有单个文本列,其中显式指定了药物 class。每种药物的行数 = 它所属的药物数量 classes。这种方法有一些优点,但会导致大量重复数据。
2) 每种药物的单行和每种药物的多个布尔列 class - "ACE Inhibitor"、"Acne Medication" 等 - 包含二进制 TRUE 或 FALSE 以指示它是否是 class.
的成员
就我个人而言,我倾向于将选项 2 作为下游分析的起点。 (正如您提到的药物可能被归类为多个 classes,也有几种药物 classes 出现等级 - 'Anaesthetic, local' 可能是 'Anaesthetic, local/Antiarrythmic'、'Anaesthetic, local/Antiseptic'等)
从您的数据框中提取所有独特的 classes 药物到列表中:
drug_class_list <- unique(drug_class_data[,1])
然后我会使用下面丑陋的代码来创建一个新的数据框:
create_flat_table <- function(df_drugs, df_classes){
# Extract list of drug classes present in df
class_list <- unique(df_classes[,1])
# Reiterate over this list creating a new column in the drug df and populating it with data
drugs <- as.list(drug_data['drug'])
results <- df_drugs
for(class in class_list){
class_drugs <- df_classes[df_classes$Drug_class == class,]
boolean_list <- toupper(df_drugs[,2])%in%class_drugs[,2]
results <- cbind(results, boolean_list ) }
colnames(results) <- c(colnames(df_drugs), class_list)
return(results) }
combined_df <- create_flat_table(drug_data, drug_class_data)
生成的数据框如下所示:
请注意,在此示例中,我更改了数据,以便您的玩具数据集中至少有一种药物与您的药物缩写列表 class 中的 class 匹配。
假设 statscript
和 DF
如末尾注释中所示可重现。然后将 class 和模式提取到 translate
中,并使用 glob 模式 pat
左连接 DF
到它。
translate <- read.table(text = statascript, as.is = TRUE)[c(4, 7)]
names(translate) <- c("class", "pat")
library(sqldf)
sqldf("select DF.*, translate.class
from DF
left join translate on upper(class) glob pat")
备注
# just first few lines for illustration
Lines <- '
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRILAT*")
'
Lines2 <- "
id drug start stop dose unit route
2010003 Amlodipine 2009-02-04 2009-11-19 1.5 mg Oral
2010003 Amlodipine 2009-11-19 2010-01-11 1.5 mg Oral
2010004 Cefprozil 2004-03-12 2004-03-19 175 mg Oral
2010004 Clobazam 2002-12-30 2003-01-01 5 mg Oral"
statascript <- readLines(textConnection(Lines))
DF <- read.table(text = Lines2, header = TRUE, as.is = TRUE)
这只是我正在使用的数据框的一小部分:
id drug start stop dose unit route
2010003 Amlodipine 2009-02-04 2009-11-19 1.5 mg Oral
2010003 Amlodipine 2009-11-19 2010-01-11 1.5 mg Oral
2010004 Cefprozil 2004-03-12 2004-03-19 175 mg Oral
2010004 Clobazam 2002-12-30 2003-01-01 5 mg Oral
我有一个 Stata do
文件,它显示了我正在尝试做的事情:
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRILAT*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "FOSINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "LISINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "RAMIPRIL*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE/BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE/CLINDAMYCIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ISOTRETINOIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ERYTHROMYCIN/TRETINOIN*")
replace class = "Acne Medication/Acute Promyelocytic Leukemia Medication" if strmatch(upper(drug), "TRETINOIN*")
replace class = "Alpha Agonist" if strmatch(upper(drug), "XYLOMETAZOLINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "DOXAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENOXYBENZAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENTOLAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PRAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TAMSULOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TERAZOSIN*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "CARVEDILOL*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "LABETALOL*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "PHENYLEPHRINE*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "MIDODRINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "CLONIDINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "DEXMEDETOMIDINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "KETAMINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "THIOPENTAL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BENZOCAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE/FENTANYL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "TETRACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "XYLOCAINE*")
replace class = "Anaesthetic, local/Antiarrythmic" if strmatch(upper(drug), "LIDOCAINE*")
replace class = "Anaesthetic, local/Antiseptic" if strmatch(upper(drug), "HEXYLRESORCINOL*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "LIDOCAINE/PRILOCAINE*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "PROPARACAINE*")
replace class = "Analgesic" if strmatch(upper(drug), "ACETAMINOPHEN*")
replace class = "Analgesic" if strmatch(upper(drug), "BELLADONNA & OPIUM SUPPOSITORY*")
我想在R中做同样的分类,但我不会Stata。
注意药物可以有不止一种class
。
如有任何建议和帮助,我们将不胜感激。
作为第一步,我将从您的 Stata 脚本中导入所有药物数据(假设数据不是干净可用的格式):
drug_class_data <- read.table("Desktop/stata_script", header=FALSE, sep='"',stringsAsFactors = FALSE)
drug_class_data <-drug_class_data[,c(2,4)]
colnames(drug_class_data) <- c('Drug_class','Drug')
删除尾随 * - 在 Stata 脚本中用作通配符
drug_class_data$Drug = gsub("\*", "", drug_class_data$Drug)
这为您提供了一个包含 2 列的数据框 ('Drug_class' & 'Drug') - 该行从 Stata 脚本的每一行中提取引号中的任何数据(下面以粗体突出显示):
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
然后我会把它保存为一个文件,然后你可以根据需要导入(我假设这个数据还没有像这样可用,因为你已经在 Stata 示例中硬编码了所有这些值)。
write.csv(drug_class_data, file = "drug_class_data.csv",row.names=FALSE)
从那里取决于你是否想要:
1) 每个药物实例的多行具有单个文本列,其中显式指定了药物 class。每种药物的行数 = 它所属的药物数量 classes。这种方法有一些优点,但会导致大量重复数据。
2) 每种药物的单行和每种药物的多个布尔列 class - "ACE Inhibitor"、"Acne Medication" 等 - 包含二进制 TRUE 或 FALSE 以指示它是否是 class.
的成员就我个人而言,我倾向于将选项 2 作为下游分析的起点。 (正如您提到的药物可能被归类为多个 classes,也有几种药物 classes 出现等级 - 'Anaesthetic, local' 可能是 'Anaesthetic, local/Antiarrythmic'、'Anaesthetic, local/Antiseptic'等)
从您的数据框中提取所有独特的 classes 药物到列表中:
drug_class_list <- unique(drug_class_data[,1])
然后我会使用下面丑陋的代码来创建一个新的数据框:
create_flat_table <- function(df_drugs, df_classes){
# Extract list of drug classes present in df
class_list <- unique(df_classes[,1])
# Reiterate over this list creating a new column in the drug df and populating it with data
drugs <- as.list(drug_data['drug'])
results <- df_drugs
for(class in class_list){
class_drugs <- df_classes[df_classes$Drug_class == class,]
boolean_list <- toupper(df_drugs[,2])%in%class_drugs[,2]
results <- cbind(results, boolean_list ) }
colnames(results) <- c(colnames(df_drugs), class_list)
return(results) }
combined_df <- create_flat_table(drug_data, drug_class_data)
生成的数据框如下所示:
请注意,在此示例中,我更改了数据,以便您的玩具数据集中至少有一种药物与您的药物缩写列表 class 中的 class 匹配。
假设 statscript
和 DF
如末尾注释中所示可重现。然后将 class 和模式提取到 translate
中,并使用 glob 模式 pat
左连接 DF
到它。
translate <- read.table(text = statascript, as.is = TRUE)[c(4, 7)]
names(translate) <- c("class", "pat")
library(sqldf)
sqldf("select DF.*, translate.class
from DF
left join translate on upper(class) glob pat")
备注
# just first few lines for illustration
Lines <- '
replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRILAT*")
'
Lines2 <- "
id drug start stop dose unit route
2010003 Amlodipine 2009-02-04 2009-11-19 1.5 mg Oral
2010003 Amlodipine 2009-11-19 2010-01-11 1.5 mg Oral
2010004 Cefprozil 2004-03-12 2004-03-19 175 mg Oral
2010004 Clobazam 2002-12-30 2003-01-01 5 mg Oral"
statascript <- readLines(textConnection(Lines))
DF <- read.table(text = Lines2, header = TRUE, as.is = TRUE)