将一个 Dataframe 映射到第二个 Dataframe

Question

我有两个数据帧，想映射两者并给出一个二进制值 1（如果存在则为 0）。

第一个 DF

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
112.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
113.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
114.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
115.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
116.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

第二个DF

id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1

在第 2 DF$cats 第一行有 1,7,1 其中 1 和 7 组合并构成 1_7 列并且在该列上放置二进制值 1并在剩余的列 0 上放置最后一个 1 数字进入 1_freq 列，如果任何行有超过 1 个类别，如 1,10,1|1,6,2，其中 1,10,1 转到 1_10 列，1,6,2 转到 1_6 列，以及两个类别的频率总结并转到 1_freq 列。

DF应该是这样的

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  0     0     0     0     0     0     1     0     0     0     1
112.txt  1     0     1     0     0     0     0     0     0     0     4
113.txt  0     0     0     0     0     1     0     0     0     1     3
114.txt  0     0     0     1     0     0     0     0     0     0     2
115.txt  0     0     0     0     1     0     0     0     0     0     1
116.txt  1     0     0     0     0     0     0     1     0     0     3

希望问题清楚。谢谢

Answer 1

这是一个使用 tidyverse 的选项。我们通过在 'cats' 列的 | 处拆分来扩展数据集的行，然后通过在最后 , 处拆分 separate 将 'cats' 分成两列，按'id'分组，得到'freq'列的sum，提取'cats'末尾的数字，转换为factor和levels指定，创建一列1s('val')，spread将其转为'wide'格式

library(tidyverse)
o1 <- df2 %>% 
       separate_rows(cats, sep = "[|]") %>% 
       separate(cats, into = c('cats', 'freq'), 
           sep=",(?=[^,]+$)", convert = TRUE) %>%
       group_by(id) %>%
       mutate(freq = sum(freq), 
              cats = factor(str_extract(cats, "\d+$"), levels = 1:10), 
              val = 1)  %>% 
       spread(cats, val, fill = 0) %>% 
       rename_at(-1, ~ paste0('1_', .))

现在，我们为与初始数据集 ('df1') 相同的列分配值

df1[is.na(df1)] <- 0
df1[names(o1)[-1]] <- o1[-1]
df1
#       id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
#1 111.txt   0   0   0   0   0   0   1   0   0    0      1
#2 112.txt   1   0   1   0   0   0   0   0   0    0      4
#3 113.txt   0   0   0   0   0   1   0   0   0    1      3
#4 114.txt   0   0   0   1   0   0   0   0   0    0      2
#5 115.txt   0   0   0   0   1   0   0   0   0    0      1
#6 116.txt   1   0   0   0   0   0   0   1   0    0      3

数据

df1 <- structure(list(id = c("111.txt", "112.txt", "113.txt", "114.txt", 
"115.txt", "116.txt"), `1_1` = c(NA, NA, NA, NA, NA, NA), `1_2` = c(NA, 
NA, NA, NA, NA, NA), `1_3` = c(NA, NA, NA, NA, NA, NA), `1_4` = c(NA, 
NA, NA, NA, NA, NA), `1_5` = c(NA, NA, NA, NA, NA, NA), `1_6` = c(NA, 
NA, NA, NA, NA, NA), `1_7` = c(NA, NA, NA, NA, NA, NA), `1_8` = c(NA, 
NA, NA, NA, NA, NA), `1_9` = c(NA, NA, NA, NA, NA, NA), `1_10` = c(NA, 
NA, NA, NA, NA, NA), `1_freq` = c(NA, NA, NA, NA, NA, NA)),
    class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(id = c("111.cats", "112.cats", "113.cats", "114.cats", 
"115.cats", "116.cats"), cats = c("1,7,1", "1,1,2|1,3,2", "1,10,1|1,6,2", 
"1,4,2", "1,5,1", "1,1,2|1,8,1")), class = "data.frame", row.names = c(NA, 
-6L))

Answer 2

虽然问题被标记了 dplyr, I was curious how a data.table 答案看起来像。

因为 df1 填充了 NA 除了 id 列和 id 列仅在尾部部分不同（txt 与 cats) 下面的答案建议完全根据 df2:

中包含的数据创建 df1

library(data.table)
library(magrittr)
long <- setDT(df2)[, strsplit(cats, "[|]"), by = id][
  , c(.(id = id), tstrsplit(V1, ","))][
    , V3 := factor(V3, levels = 1:10)]
df1 <- dcast(long, id ~ V3, function(x) pmax(1, length(x)), 
             value.var = "V3", drop = FALSE, fill = 0)[
               long[, sum(as.integer(V4)), by = id], on = "id", freq := V1][
                 , id := stringr::str_replace(id, "cats$", "txt")][
                   , setnames(.SD, names(.SD)[-1], paste0("1_", names(.SD)[-1]))]
df1

        id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
1: 111.txt   0   0   0   0   0   0   1   0   0    0      1
2: 112.txt   1   0   1   0   0   0   0   0   0    0      4
3: 113.txt   0   0   0   0   0   1   0   0   0    1      3
4: 114.txt   0   0   0   1   0   0   0   0   0    0      2
5: 115.txt   0   0   0   0   1   0   0   0   0    0      1
6: 116.txt   1   0   0   0   0   0   0   1   0    0      3

说明

强制转换为 data.table 后，df2 通过在“|”处拆分 cats 列，从“字符串化”宽格式重塑为 long 形式首先然后将逗号分隔的部分分成单独的列 V2 到 V4.

然后 V3 从字符变为因子以在调用 dcast() 再次从长格式重塑为宽格式时保留列的顺序。由于 OP 已要求在至少存在一种组合的情况下显示 1，因此必须在此处使用自定义函数定义 function(x) pmax(1, length(x)) 而不是简单地 length。在 更新联接 中，频率总和作为列 freq 附加。最后，id 列中的“cats”被替换为“txt”，列名（id 列除外）的前缀为“1_”。

数据

df2 <- data.table::fread("id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1", data.table = FALSE)

将一个 Dataframe 映射到第二个 Dataframe

Map one Dataframe to a second Dataframe

r

gsub

stringr

dplyr

数据

说明

数据