在 R 中使用 tidyr 修复混乱的数据框

fixing a messy dataframe with tidyr in R

我有一个家庭观察数据集;每个家庭中都有个人。每个家庭的人数不同。家庭用 id 标识,家庭成员根据他们接受采访的顺序标识。因此,如果家庭 1 有 4 个成员,则变量 id 在所有成员中都是相同的,但变量 order 从 1 变为 4。我遇到的问题是,对于某些变量,只有第一个家庭成员为其他成员作答;因此我的数据集中混合了长格式和宽格式。

我需要做的是将第一个家庭成员回答的值分配给相应的家庭成员。为了进一步解释我的数据结构,我将给出以下玩具示例:

  df <- data.frame( id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5), 
              age = c(54,20,23,17, 60,57,28,33,19), 
              educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
              educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA), 
              educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
              educDebt1t = c("student loan", NA,NA,NA, 
                         "student loan", NA, NA, NA, NA),
              educDebt2t = c("student fund", NA, NA, NA, 
                         "bank credit", NA, NA, NA, NA),
              educdebt3t = c(NA, NA, NA, NA, 
                         "bank credit", NA, NA, NA, NA),
              educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
              educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
              educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),

              bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
              bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA), 
              bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
              bankDebt1t = c("car loan", NA,NA,NA, 
                             "consumer loan", NA, NA, NA, NA),
              bankDebt2t = c("car loan", NA, NA, NA, 
                             "car loan", NA, NA, NA, NA),
              bankdebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
              bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
              bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
              bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA))

我只显示了一些列,以免页面混乱。

id order age educDebt1 educDebt2 educDebt3   educDebt1t   educDebt2t  educdebt3A
 1     1  54         1         3        NA student loan student fund  NA
 1     2  20        NA        NA        NA           NA         NA    NA
 1     3  23        NA        NA        NA           NA         NA    NA
 1     4  17        NA        NA        NA           NA         NA    NA
 2     1  60         3         5         4 student loan  bank credit  bank credit
 2     2  57        NA        NA        NA           NA         NA    NA
 2     3  28        NA        NA        NA           NA         NA    NA
 2     4  33        NA        NA        NA           NA         NA    NA
 2     5  19        NA        NA        NA           NA         NA    NA

在上面的玩具示例中,我有一个家庭级别变量 id 和个人级别变量: order 对应于家庭中个人的顺序; age 是他们的年龄。其他变量对应于债务。对于每种类型的债务,一个家庭最多可以报告三笔债务。在这种情况下,有两种类型的债务,教育债务educDebt或银行债务bankdebt(上面只显示一种类型)。

所以在每个家庭中,只有 order == 1 对应的成员为家庭中的其他成员回答。在 educDebt1educDebt3 中,该值对应于有债务的家庭成员,因此,如果我们看一下第一行,它表示家庭 1 的家庭成员 1 具有受教育程度债务,以及家庭成员 3。然后,从 educDebt1teducDebt3t,它告诉家庭成员有哪种债务。住户2中,有3人欠债,住户成员:3、5、4

然后我们还有一种债务,银行债务,逻辑和之前一样。

我想要完成的,就是让每个家庭成员和他们的债务都井井有条,就像这样:

id order age      educDebt     educDebt_r      bankDebt     bankDebt_r
 1     1  54   student loan        yes         car loan          yes
 1     2  20        NA              NA               NA           NA
 1     3  23   student fund         no               NA           NA
 1     4  17        NA              NA         car loan           no
 2     1  60        NA              NA               NA           NA
 2     2  57        NA              NA         car loan           no
 2     3  28   student loan         no    consumer loan          yes
 2     4  33    bank credit        yes               NA           NA
 2     5  19    bank credit         no               NA           NA

为了实现这一点,我实际上将数据划分在不同的表中,一个包含前三个变量,另一个包含每种类型的债务。对于债务表,我只保留了受访成员的行,并将数据重塑为长格式,使每一行成为一个家庭成员,然后我按家庭和家庭成员 ID 合并表格,但有很多债务类型,并且我的方法效率很低。有什么办法可以用 tidyr 包达到同样的效果吗?

我的方法如下:

首先,我创建了三个数据框,为每一行提取不同的列索引。我用 for 循环做到了。

newdf1 <- data.frame()
ind <- c(1,seq(4,19, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf1 <- rbind(newdf1, fila, stringsAsFactors = FALSE )
}

newdf2 <- data.frame()
ind <- c(1,seq(5,20, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf2 <- rbind(newdf2, fila, stringsAsFactors = FALSE )
}

newdf3 <- data.frame()
ind <- c(1,seq(6,21, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf3 <- rbind(newdf3, fila, stringsAsFactors = FALSE )
}

然后我将它们绑定:

NewDfs <- rbind(newdf1,setNames(newdf2, names(newdf1)), 
                            setNames(newdf3, names(newdf1)))

names(NewDfs ) <- c("id", "order", "educDebt", "educDebt_r",
               "order", "bankDebt", "bankDebt_r")

从这个dataframe中,我在一个dataframe中提取了关于教育的债务,在另一个dataframe中提取了关于银行的债务,只保留完整的案例,并通过idorder将它们合并在一起。

educ <- NewDfs [,c(1:4)]
bank <- NewDfs [,c(1,5:7)]
educ <- educ[complete.cases(educ), ]
bank <- bank[complete.cases(bank), ]

我还用原始数据集的前三列创建了一个数据帧。

df_household <- df[,1:3]

并将其与 educ_bank 数据框合并。

dfMerged <- merge(df_hog, educ_bank, by = c("id", "order"), all.x = TRUE)

 id order age     educDebt educDebt_r      bankDebt bankDebt_r
  1     1  54 student loan        yes      car loan        yes
  1     2  20         <NA>       <NA>          <NA>       <NA>
  1     3  23 student fund         no          <NA>       <NA>
  1     4  17         <NA>       <NA>      car loan         no
  2     1  60         <NA>       <NA>          <NA>       <NA>
  2     2  57         <NA>       <NA>      car loan         no
  2     3  28 student loan         no consumer loan        yes
  2     4  33  bank credit        yes          <NA>       <NA>
  2     5  19  bank credit         no          <NA>       <NA>

显然,这似乎不是最直接的方法,我想知道是否有更简单的方法来实现相同的 tidyr

我没有完全tidyr(和dplyr)的解决方案,但也许更熟悉它的人可以提供帮助。 (还有空间包含更多 tidyverse,特别是 purrr,以替换一些基本 R 代码,但我认为没有必要。)我将在底.

数据

首先,我认为某些列的名称有误(小写"debt"),所以我修正了它;这不是绝对重要的,但它使某些事情变得容易得多。我还禁用因子,因为某些操作(在下面的 debt 上)需要字符串。如果拥有 factor 很重要,我建议您在此过程后重新 factor

df <- data.frame(
  id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5), 
  age = c(54,20,23,17, 60,57,28,33,19), 
  educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA), 
  educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
  educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
  educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
  educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
  bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA), 
  bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
  bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
  bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
  bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
  stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)

步进

最终,我们将在 age 中合并,由于所有受访者都由 idorder 标识,因此我们将三者分开:

maintbl <- select(df, id, order, age)

(对我而言)首先要意识到的是,您需要从宽到高转换,但要针对每三个列组分别进行转换。我将从第一组三个开始:

grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
  gather(debt, order, -id) %>%
  filter(! is.na(order)) %>%
  arrange(id, order)
#   id      debt order
# 1  1 educDebt1     1
# 2  1 educDebt2     3
# 3  2 educDebt1     3
# 4  2 educDebt3     4
# 5  2 educDebt2     5

(顺便说一句:我使用 grp 的原因稍后会很明显。)
(顺便说一句 2:我使用正则表达式 [0-9]+ 来匹配一个 或更多 数字,以防扩展到包括超过 9 或 "arbitrary" 编号。感觉可以随意省略 +.)

这看起来不错。我们现在需要 cbind 这三个的 *t 变体:

select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
  gather(debt, type, -id) %>%
  filter(! is.na(type)) %>%
  mutate(debt = gsub("t$", "", debt))
#   id      debt         type
# 1  1 educDebt1 student loan
# 2  2 educDebt1 student loan
# 3  1 educDebt2 student fund
# 4  2 educDebt2  bank credit
# 5  2 educDebt3  bank credit

我更改了 debt 以删除结尾的 t,因为稍后我将使用它作为合并列。我对第三组三列("educDebt"),t_r 列做同样的事情。

这三列需要合并,所以这里我将它们放在一个列表中,Reduce它们:

Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
       list(
         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
           gather(debt, order, -id) %>%
           filter(! is.na(order)) %>%
           arrange(id, order),
         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
           gather(debt, type, -id) %>%
           filter(! is.na(type)) %>%
           mutate(debt = gsub("t$", "", debt)),
         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
           gather(debt, r, -id) %>%
           filter(! is.na(r)) %>%
           mutate(debt = gsub("t_r$", "", debt))
       ))
#   id      debt order         type   r
# 1  1 educDebt1     1 student loan yes
# 2  1 educDebt2     3 student fund  no
# 3  2 educDebt1     3 student loan  no
# 4  2 educDebt3     4  bank credit yes
# 5  2 educDebt2     5  bank credit  no

我需要重命名最后两列,因为我已经完成了 typer 列的合并,所以我可以删除 debt。 (我通常建议 dplyr::rename_,但由于它很快就会被弃用,所以我手动进行。如果您的列比此处显示的多得多,您可能需要调整列编号等。)

最后,我们需要为每个 "educDebt""bankDebt" 执行此操作,通过 idorder 加入它们(使用另一个 Reduce) , 最后在 age.

中重新合并

TL;DR

Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
       lapply(c("educDebt", "bankDebt"), function(grp) {
         ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
                       list(
                         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
                           gather(debt, order, -id) %>%
                           filter(! is.na(order)) %>%
                           arrange(id, order),
                         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
                           gather(debt, type, -id) %>%
                           filter(! is.na(type)) %>%
                           mutate(debt = gsub("t$", "", debt)),
                         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
                           gather(debt, r, -id) %>%
                           filter(! is.na(r)) %>%
                           mutate(debt = gsub("t_r$", "", debt))
                       ))
         names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
         select(ret, -debt)
       })
       ) %>%
  left_join(maintbl, ., by = c("id", "order"))
#   id order age     educDebt educDebt_r      bankDebt bankDebt_r
# 1  1     1  54 student loan        yes      car loan        yes
# 2  1     2  20         <NA>       <NA>          <NA>       <NA>
# 3  1     3  23 student fund         no          <NA>       <NA>
# 4  1     4  17         <NA>       <NA>          <NA>       <NA>
# 5  2     1  60         <NA>       <NA>          <NA>       <NA>
# 6  2     2  57         <NA>       <NA>          <NA>       <NA>
# 7  2     3  28 student loan         no consumer loan        yes
# 8  2     4  33  bank credit        yes          <NA>       <NA>
# 9  2     5  19  bank credit         no          <NA>       <NA>