在以逗号分隔的字符向量中查找唯一值,然后进行单热编码
Find unique values in a character vector separated by commas and then one-hot encoding
基本上我有一个用逗号分隔的字符串向量。我正在寻找使用字符串的唯一值进行单热编码。我相信我必须首先找到唯一值(以逗号分隔)用作单热编码之前的列,但我不确定。例如,假设我有以下字符向量:
people_names
Bob,Megan,Mike,Sarah
Mike,Sarah
Megan,Sarah
Bob
我希望创建一个与此向量相对应的单热编码数据帧,如下所示:
Bob Megan Mike Sarah
1 1 1 1
0 0 1 1
0 1 0 1
1 0 0 0
感谢您的帮助。非常感谢。
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")
library(tidyverse)
data.frame(people_names) %>% # create a dataframe
mutate(id = row_number(), # add row id (useful for reshaping)
value = 1) %>% # add a column of 1s to denote existence
separate_rows(people_names) %>% # create one row per name keeping relevant info
spread(people_names, value, fill = 0) %>% # reshape
select(-id) # remove row id
# Bob Megan Mike Sarah
# 1 1 1 1 1
# 2 0 0 1 1
# 3 0 1 0 1
# 4 1 0 0 0
作为替代方案,splitstackshape
包中有一个您可能会觉得有用的辅助函数。输出是一个矩阵
splitstackshape:::charMat(strsplit(people_names, ","), fill = 0L)
# Bob Megan Mike Sarah
#[1,] 1 1 1 1
#[2,] 0 0 1 1
#[3,] 0 1 0 1
#[4,] 1 0 0 0
从同一个包你也可以试试cSplit_e
library(splitstackshape)
out <- cSplit_e(
data.frame(people_names),
split.col = "people_names",
sep = ",",
mode = "binary",
type = "character",
fill = 0L,
drop = TRUE
)
# remove prefix of column names
(out <- setNames(out, sub("people_names_", "", names(out), fixed = TRUE)))
数据
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")
基本上我有一个用逗号分隔的字符串向量。我正在寻找使用字符串的唯一值进行单热编码。我相信我必须首先找到唯一值(以逗号分隔)用作单热编码之前的列,但我不确定。例如,假设我有以下字符向量:
people_names
Bob,Megan,Mike,Sarah
Mike,Sarah
Megan,Sarah
Bob
我希望创建一个与此向量相对应的单热编码数据帧,如下所示:
Bob Megan Mike Sarah
1 1 1 1
0 0 1 1
0 1 0 1
1 0 0 0
感谢您的帮助。非常感谢。
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")
library(tidyverse)
data.frame(people_names) %>% # create a dataframe
mutate(id = row_number(), # add row id (useful for reshaping)
value = 1) %>% # add a column of 1s to denote existence
separate_rows(people_names) %>% # create one row per name keeping relevant info
spread(people_names, value, fill = 0) %>% # reshape
select(-id) # remove row id
# Bob Megan Mike Sarah
# 1 1 1 1 1
# 2 0 0 1 1
# 3 0 1 0 1
# 4 1 0 0 0
作为替代方案,splitstackshape
包中有一个您可能会觉得有用的辅助函数。输出是一个矩阵
splitstackshape:::charMat(strsplit(people_names, ","), fill = 0L)
# Bob Megan Mike Sarah
#[1,] 1 1 1 1
#[2,] 0 0 1 1
#[3,] 0 1 0 1
#[4,] 1 0 0 0
从同一个包你也可以试试cSplit_e
library(splitstackshape)
out <- cSplit_e(
data.frame(people_names),
split.col = "people_names",
sep = ",",
mode = "binary",
type = "character",
fill = 0L,
drop = TRUE
)
# remove prefix of column names
(out <- setNames(out, sub("people_names_", "", names(out), fixed = TRUE)))
数据
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")