R data.table 如何在多个二进制数据列中用列名替换正值

R data.table how to replace positive values with column names across multiple binary data columns

我使用的是 R v. 3.2.1 和 data.table v 1.9.6。 我有一个如下例所示的 data.table,它包含一些编码的二进制列,这些列被归类为值为“0”和“1”的字符,还有一个字符串向量,其中包含一些与二进制列相同的词的短语名字。我的最终目标是使用字符串向量中的词和二进制向量中的正响应创建一个词云。为此,我首先需要将二进制向量中的正响应转换为它们的列名,但我遇到了困难。

有人问过类似的问题 here 但它与海报以矩阵开头并不完全相同,建议的解决方案似乎不适用于更复杂的数据集。除了我的二进制列之外,我还有其他列,因此解决方案需要首先准确识别我的二进制列。

下面是一些示例数据:

id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)
dt <- data.table(df)
dt[, id := as.numeric(id)]

数据如下:

    id age apple pear banana                                          favfood
1:  1   5     0    1      0                                i love pear juice
2:  2   1     1    1      1 i eat chinese pears and crab apples every sunday
3:  3  11    NA    1      1                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

因此,如果 apple==1 或 favfood 包含字符串 "apple" 或两者都包含,则词云对于 apples 的频率应该为 1,依此类推。

这是我的尝试(没有达到我的要求,但已经完成了一半):

# First define the logic columns.
# I've done this by name here but in my real data set this won't work because there are too many    
logicols <- c("apple", "pear", "banana")

# Next identify the location of the "1"s within the subset of logic columns:
ones <- which(dt==1 & colnames(dt) %in% logicols, arr.ind=T)

# Lastly, convert the "1"s in the subset to their column names:
dt[ones, ]<-colnames(dt)[ones[,2]]

这给出:

> dt
   id age apple pear banana                                          favfood
1:  1   5     0 pear      0                                i love pear juice
2:  2   1     1 pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA    1 banana                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

这种方法有两个问题:

(a) 通过名称识别要转换的列对于我的真实数据集来说并不方便,因为它们有很多。我如何在不包括其他包含 1 但也有其他值的列的情况下识别该列的子集(在此示例中 "age" 包含 1 但它显然不是逻辑列)?在示例中,我故意将 "age" 编码为字符列,因为在我的真实数据集中,有包含 1 的字符列不是逻辑列。将它们区分开来的特征是我的逻辑列是字符,但仅包含值 0、1 或缺失值 (NA)。

(b) 索引没有把逻辑列中的所有1都捡起来,有谁知道这是为什么(例如"apple"列第二行的1没有转换)?

非常感谢您的帮助 - 我确定我遗漏了一些相对简单的东西,但我仍然坚持这一点。

感谢@Frank 指出 logic/binary 列应该用 as.logical() 转换为正确的 class。

这大大简化了对要更改的值的识别,索引现在似乎也可以正常工作:

# Starting with the data in its original format:
id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)

# Convert the "0" / "1" character columns to logical with a function:

    > recode.multi
    function(data, recode.cols, old.var, new.var, format = as.numeric){
      # function to recode multiple columns 
      #
      # Args:        data: a data.frame 
      #       recode.cols: a character vector containing the names of those 
      #                    columns to recode
      #           old.var: a character vector containing values to be recorded
      #           new.var:  a character vector containing desired recoded values
      #            format: a function descrbing the desired format e.g.
      #                    as.character, as.numeric, as.factor, etc.. 

      # check from and to are of equal length
      if(length(old.var) == length(new.var)){
        NULL
      } else {
        stop("'from' and 'to' are of differing lengths")
      }

      # convert format of selected columns to character
      if(length(recode.cols) == 1){
        data[, recode.cols] = as.character(data[, recode.cols])
      } else {
        data[, recode.cols] = data.frame(lapply(data[, recode.cols], as.character), stringsAsFactors=FALSE)
      }


      # recode old variables to new variables for selected columns
      for(i in 1:length(old.var)){
        data[, recode.cols][data[, recode.cols] == old.var[i]] = new.var[i]
      }


  # convert recoded columns to desired format 
  data[, recode.cols] = sapply(data[, recode.cols], format)

  data
}

df = recode.multi(data = df, recode.cols = c(unlist(strsplit("apple pear banana", split=" "))), old.var = c("0", "1", NA), new.var = c(FALSE, TRUE, NA), format = as.logical)

dt <- data.table(df)
dt[, id := as.numeric(id)]

# Identify the values to swap with column names:
convtoname <- which(dt==TRUE, arr.ind=T)

# Make the swap:
dt[convtoname, ]<-colnames(dt)[convtoname[,2]]

这给出了期望的结果:

> dt
   id age apple  pear banana                                          favfood
1: id   5 FALSE  pear  FALSE                                i love pear juice
2:  2   1 apple  pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA  pear banana                           i also like apple tart
4:  4  20 apple FALSE     NA                          i like crab apple juice
5:  5  21 FALSE FALSE banana                 i hate most fruit except bananas