将列表与多个向量进行比较，中断 'loop'，并填充新列

Question

我是 R 的新手，正在寻找答案。在过去的 2 周里，我从找到可以修改的答案中学到了很多东西。这次我真的卡住了。

我希望根据 20 多个列中的值填充一个新变量 Abuse。我寻找的值是有优先级的，因此我希望

到 'break' 搜索如果找到值，
用字符串填充滥用，
并从下一个 'row' 重新开始搜索。

作为一名 SAS 程序员，我用 do while 循环编写了这个代码 - 并且正在努力学习 R 中向量的优势。

有 20 多个 diag_codes，这里只包括了几个。

   diag_codes <- c("admitting_diagnosis", "princ_diag_code",   
"oth_diag_code_1",
"oth_diag_code_2" )


non_fall2_flag  <- read.table(header=TRUE, text=
                "admitting_diagnosis princ_diag_code poa_princ_diag_code oth_diag_code_1 poa_oth_diag_code_1 oth_diag_code_2

                          27651   73026   Y   99559   Y   80703
                          99550   99550   Y   85220   Y   591
                          78609   486 Y   99559   Y   1320
                          78039   78609   Y   7707    Y   99550
                          78065   99559   Y   9916    Y   3379
                          99550   99554   Y   3158    Y   1330
                          9941    9941    Y   99559   Y   2760
                          78039   99559   Y   51889   Y   V1505
                          ")

感谢@42-@42 这个解决方案有效：

      non_fall2_flag$abuse <-  apply( non_fall2_flag[diag_codes], 1, 
   function(x) if('99559' %in% x) {"other abuse"} else 
  if ('99550' %in% x) {"unspec."} else {""} )

这促使我尝试了一项需要更大灵活性的类似任务 - 但注释行不起作用。与多个值的子字符串比较将不起作用。

diag_codes <- c("admitting_diagnosis", "princ_diag_code",   
            "oth_diag_code_1",
            "oth_diag_code_2" )




child_data <- read.table(header=TRUE, text=
                       "admitting_diagnosis princ_diag_code poa_princ_diag_code oth_diag_code_1 poa_oth_diag_code_1 oth_diag_code_2

                          27651   73026   Y   99559   Y   80103
                          99550   99550   Y   85220   Y   591
                          78609   486 Y   99559   Y   1320
                          78039   92519   Y   7707    Y   99550
                          78065   99559   Y   9916    Y   3379
                          99550   99554   Y   3158    Y   1330
                          9941    9941    Y   95901   Y   2760
                          78039   99559   Y   80389   Y   V1505
                          ")

child_data$broad <-  apply( child_data[,diag_codes] ,1 ,
                           function(x) 
                             # if (substr(x,1,3)  %in% c('800', '801', '803')) {1} else 
                              if ( any( '9251' == substr(x,1,4) )  ) {1} else 
                       if ( any( '95901'  == substr(x,1,5))  ) {1} else {0})

Answer 1

您从 SAS 时代学到了一些东西，但首先这里有一个解决方案：

 non_fall2_flag$abuse <-  apply( non_fall2_flag[diag_codes], 1, 
       function(x) if('99559' %in% x) {"other abuse"} else 
                        if ('99550' %in% x) {"unspec."} else {""} )

需要忘记的事情是 R 没有以您在数据步骤中熟悉的方式的隐式面向行的循环机制。第二个是 ifelse 被设计为 return 向量，但你不应该在结果和替代表达式中使用 <-。相反，您需要提供两个向量，ifelse 机器将进行选择。任何赋值都应该在 ifelse 之外。如果您一直在使用单个列而不是想一次测试多个列，则可以使用 ifelse.

我的代码使用 %in% 一次对整行应用成员资格测试。当 apply 与第二个参数 1 一起使用时，整行将传递给第三个位置的函数的正式参数。另一种同时处理多个列的方法可能是使用 mapply，但那样的话您将需要单独提取列，这将是一个更加庞大的代码。

我修改了您的数据样本，以便至少有两行符合您的测试，然后成功了：

non_fall2_flag $broad <-  apply( non_fall2_flag[,diag_codes] ,1 ,
                            function(x) 
                              if ( any( '9251' == substr(x,1,4) )  ) {1} else 
                           if ( any( '95901'  == substr(x,1,5))  ) {1} else {0})
non_fall2_flag

请注意，any 函数会将一组逻辑测试压缩为单个值，而您的代码只会测试 return 由 [=20] 编辑的向量的第一个值=].

Answer 2

如果我正确理解你的问题/代码的逻辑：

如果存在“99559”，则滥用<-"other abuse"
elseif '99550' 存在，然后滥用<-"other abuse"
否则滥用<-""

这里有一些简洁的矢量化代码可以解决这个问题。

# put the codes into a matrix for faster processing
myMat <- sapply(non_fall2_flag[, diag_codes],
                function(i) as.integer(gsub("[^0-9]+", "", i)))
# get indicators for both codes
check_1 <- as.integer(rowSums(myMat == 99559) > 0)
check_2 <- as.integer(rowSums(myMat == 99550) > 0)

# fill in variable
non_fall2_flag$abuse <-
                   c("", "other abuse", "unspec.")[pmax(1, 2*check_2, 3*check_1)]

最后一行使用两个校验向量来填充不同的字符串pmax(1, 2*check_2, 3*check_1)按照上面的逻辑设置

这个returns

non_fall2_flag
  admitting_diagnosis princ_diag_code poa_princ_diag_code oth_diag_code_1 poa_oth_diag_code_1 oth_diag_code_2       abuse
1               27651           73026                   Y           99559                   Y           80703     unspec.
2               99550           99550                   Y           85220                   Y             591 other abuse
3               78609             486                   Y           99559                   Y            1320     unspec.
4               78039           78609                   Y            7707                   Y           99550 other abuse
5               78065           99559                   Y            9916                   Y            3379     unspec.
6               99550           99554                   Y            3158                   Y            1330 other abuse
7                9941            9941                   Y           99559                   Y            2760     unspec.
8               78039           99559                   Y           51889                   Y           V1505     unspec.

将列表与多个向量进行比较，中断 'loop'，并填充新列

Compare a list against multiple vectors, break 'loop', and populate new column

r

list

break