神经网络编码数据
coding data for neural net
我在处理数据以在神经网络中使用它时遇到问题,我的 table 看起来像这样:
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110
(...)
正如我之前发现的那样,我将能够使用其中的数据,应该将其转换为虚拟变量。我不知道如何处理列中的多个值,目标是具有像这样的矩阵:
drug.name molecular.weight target1 target2 target3(...)
drug1 225 1 0 0
drug2 225 0 1 1
(...)
数据集非常大,所以我无法手动创建和填充新列。
希望你能理解我;)
塞巴斯蒂安
这是一个 "hacky" 解决方案,但它似乎适用于我测试过的小型设备。有很多警告,但我不确定如何让它们消失。如果有人可以建议简化,我会更新答案。
注意:请参阅下面 EDIT 中的解决方案,它解决了原始问题中的更改。下面的初始解决方案解决了最初提出的问题。
library(dplyr)
library(tidyr)
### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110')
drug_df
## drug.name molecular.target molecular.weight
##1 drug1 target1 225
##2 drug2 target2,target3 210
##3 drug3 target4,target1 120
##4 drug4 target1,target2,target3 110
### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Only want drug targets, not empty targets
rename(key = val) %>% # "val" will be used as new "key"
group_by(drug.name) %>% # Group to get target count
mutate(val = n()) %>% # set val to target count
spread(key, val, fill = 0) # Put in final format
drug_df_new
## drug.name molecular.weight target1 target2 target3 target4
## (chr) (int) (dbl) (dbl) (dbl) (dbl)
##1 drug1 225 1 0 0 0
##2 drug2 210 0 2 2 0
##3 drug3 120 2 0 0 2
##4 drug4 110 3 3 3 0
编辑
下面的替代解决方案解决了原始 post 中的更改。这将获得 post 的编辑版本中指定的结果。当确认以下解决方案适用于更大的数据集时,将删除上述解决方案。
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
mutate(new_val = ifelse(!is.na(val), 1, 0)) %>% # Create the new value 0 or 1
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Remove lines where no target exists
spread(val, new_val, fill = 0) # Put in longer format.
## drug.name molecular.weight target1 target2 target3 target4
##1 drug1 225 1 0 0 0
##2 drug2 210 0 1 1 0
##3 drug3 120 1 0 0 1
##4 drug4 110 1 1 1 0
我在处理数据以在神经网络中使用它时遇到问题,我的 table 看起来像这样:
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110
(...)
正如我之前发现的那样,我将能够使用其中的数据,应该将其转换为虚拟变量。我不知道如何处理列中的多个值,目标是具有像这样的矩阵:
drug.name molecular.weight target1 target2 target3(...)
drug1 225 1 0 0
drug2 225 0 1 1
(...)
数据集非常大,所以我无法手动创建和填充新列。
希望你能理解我;) 塞巴斯蒂安
这是一个 "hacky" 解决方案,但它似乎适用于我测试过的小型设备。有很多警告,但我不确定如何让它们消失。如果有人可以建议简化,我会更新答案。
注意:请参阅下面 EDIT 中的解决方案,它解决了原始问题中的更改。下面的初始解决方案解决了最初提出的问题。
library(dplyr)
library(tidyr)
### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110')
drug_df
## drug.name molecular.target molecular.weight
##1 drug1 target1 225
##2 drug2 target2,target3 210
##3 drug3 target4,target1 120
##4 drug4 target1,target2,target3 110
### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Only want drug targets, not empty targets
rename(key = val) %>% # "val" will be used as new "key"
group_by(drug.name) %>% # Group to get target count
mutate(val = n()) %>% # set val to target count
spread(key, val, fill = 0) # Put in final format
drug_df_new
## drug.name molecular.weight target1 target2 target3 target4
## (chr) (int) (dbl) (dbl) (dbl) (dbl)
##1 drug1 225 1 0 0 0
##2 drug2 210 0 2 2 0
##3 drug3 120 2 0 0 2
##4 drug4 110 3 3 3 0
编辑
下面的替代解决方案解决了原始 post 中的更改。这将获得 post 的编辑版本中指定的结果。当确认以下解决方案适用于更大的数据集时,将删除上述解决方案。
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
mutate(new_val = ifelse(!is.na(val), 1, 0)) %>% # Create the new value 0 or 1
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Remove lines where no target exists
spread(val, new_val, fill = 0) # Put in longer format.
## drug.name molecular.weight target1 target2 target3 target4
##1 drug1 225 1 0 0 0
##2 drug2 210 0 1 1 0
##3 drug3 120 1 0 0 1
##4 drug4 110 1 1 1 0