神经网络编码数据

coding data for neural net

我在处理数据以在神经网络中使用它时遇到问题,我的 table 看起来像这样:

drug.name    molecular.target         molecular.weight

drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110
                     (...)

正如我之前发现的那样,我将能够使用其中的数据,应该将其转换为虚拟变量。我不知道如何处理列中的多个值,目标是具有像这样的矩阵:

drug.name molecular.weight  target1  target2 target3(...)

drug1     225               1        0       0
drug2     225               0        1       1 
                          (...)

数据集非常大,所以我无法手动创建和填充新列。

希望你能理解我;) 塞巴斯蒂安

这是一个 "hacky" 解决方案,但它似乎适用于我测试过的小型设备。有很多警告,但我不确定如何让它们消失。如果有人可以建议简化,我会更新答案。

注意:请参阅下面 EDIT 中的解决方案,它解决了原始问题中的更改。下面的初始解决方案解决了最初提出的问题。

library(dplyr)
library(tidyr)

### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name    molecular.target         molecular.weight
drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110')
drug_df

##  drug.name        molecular.target molecular.weight
##1     drug1                 target1              225
##2     drug2         target2,target3              210
##3     drug3         target4,target1              120
##4     drug4 target1,target2,target3              110

### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))

drug_df_new <-
    drug_df %>%
    separate(molecular.target, targetset, ',')      %>% # Create new target columns
    gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
    select(-key)                                    %>% # Remove key column, it isn't needed.
    filter(!is.na(val))                             %>% # Only want drug targets, not empty targets
    rename(key = val)                               %>% # "val" will be used as new "key"
    group_by(drug.name)                             %>% # Group to get target count
    mutate(val = n())                               %>% # set val to target count
    spread(key, val, fill = 0)                          # Put in final format

drug_df_new
##  drug.name molecular.weight target1 target2 target3 target4
##      (chr)            (int)   (dbl)   (dbl)   (dbl)   (dbl)
##1     drug1              225       1       0       0       0
##2     drug2              210       0       2       2       0
##3     drug3              120       2       0       0       2
##4     drug4              110       3       3       3       0

编辑

下面的替代解决方案解决了原始 post 中的更改。这将获得 post 的编辑版本中指定的结果。当确认以下解决方案适用于更大的数据集时,将删除上述解决方案。

drug_df_new <-
     drug_df %>%
     separate(molecular.target, targetset, ',')      %>% # Create new target columns
     gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
     mutate(new_val = ifelse(!is.na(val), 1, 0))     %>% # Create the new value 0 or 1
     select(-key)                                    %>% # Remove key column, it isn't needed.
     filter(!is.na(val))                             %>% # Remove lines where no target exists
     spread(val, new_val, fill = 0)                      # Put in longer format.

##  drug.name molecular.weight target1 target2 target3 target4
##1     drug1              225       1       0       0       0
##2     drug2              210       0       1       1       0
##3     drug3              120       1       0       0       1
##4     drug4              110       1       1       1       0