在 R 中使用 tidyr 修复混乱的数据框
fixing a messy dataframe with tidyr in R
我有一个家庭观察数据集;每个家庭中都有个人。每个家庭的人数不同。家庭用 id
标识,家庭成员根据他们接受采访的顺序标识。因此,如果家庭 1 有 4 个成员,则变量 id
在所有成员中都是相同的,但变量 order
从 1 变为 4。我遇到的问题是,对于某些变量,只有第一个家庭成员为其他成员作答;因此我的数据集中混合了长格式和宽格式。
我需要做的是将第一个家庭成员回答的值分配给相应的家庭成员。为了进一步解释我的数据结构,我将给出以下玩具示例:
df <- data.frame( id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA,
"student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA,
"bank credit", NA, NA, NA, NA),
educdebt3t = c(NA, NA, NA, NA,
"bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA,
"consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA,
"car loan", NA, NA, NA, NA),
bankdebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA))
我只显示了一些列,以免页面混乱。
id order age educDebt1 educDebt2 educDebt3 educDebt1t educDebt2t educdebt3A
1 1 54 1 3 NA student loan student fund NA
1 2 20 NA NA NA NA NA NA
1 3 23 NA NA NA NA NA NA
1 4 17 NA NA NA NA NA NA
2 1 60 3 5 4 student loan bank credit bank credit
2 2 57 NA NA NA NA NA NA
2 3 28 NA NA NA NA NA NA
2 4 33 NA NA NA NA NA NA
2 5 19 NA NA NA NA NA NA
在上面的玩具示例中,我有一个家庭级别变量 id
和个人级别变量: order
对应于家庭中个人的顺序; age
是他们的年龄。其他变量对应于债务。对于每种类型的债务,一个家庭最多可以报告三笔债务。在这种情况下,有两种类型的债务,教育债务educDebt
或银行债务bankdebt
(上面只显示一种类型)。
所以在每个家庭中,只有 order == 1
对应的成员为家庭中的其他成员回答。在 educDebt1
到 educDebt3
中,该值对应于有债务的家庭成员,因此,如果我们看一下第一行,它表示家庭 1 的家庭成员 1 具有受教育程度债务,以及家庭成员 3。然后,从 educDebt1t
到 educDebt3t
,它告诉家庭成员有哪种债务。住户2中,有3人欠债,住户成员:3、5、4
然后我们还有一种债务,银行债务,逻辑和之前一样。
我想要完成的,就是让每个家庭成员和他们的债务都井井有条,就像这样:
id order age educDebt educDebt_r bankDebt bankDebt_r
1 1 54 student loan yes car loan yes
1 2 20 NA NA NA NA
1 3 23 student fund no NA NA
1 4 17 NA NA car loan no
2 1 60 NA NA NA NA
2 2 57 NA NA car loan no
2 3 28 student loan no consumer loan yes
2 4 33 bank credit yes NA NA
2 5 19 bank credit no NA NA
为了实现这一点,我实际上将数据划分在不同的表中,一个包含前三个变量,另一个包含每种类型的债务。对于债务表,我只保留了受访成员的行,并将数据重塑为长格式,使每一行成为一个家庭成员,然后我按家庭和家庭成员 ID 合并表格,但有很多债务类型,并且我的方法效率很低。有什么办法可以用 tidyr 包达到同样的效果吗?
我的方法如下:
首先,我创建了三个数据框,为每一行提取不同的列索引。我用 for 循环做到了。
newdf1 <- data.frame()
ind <- c(1,seq(4,19, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf1 <- rbind(newdf1, fila, stringsAsFactors = FALSE )
}
newdf2 <- data.frame()
ind <- c(1,seq(5,20, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf2 <- rbind(newdf2, fila, stringsAsFactors = FALSE )
}
newdf3 <- data.frame()
ind <- c(1,seq(6,21, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf3 <- rbind(newdf3, fila, stringsAsFactors = FALSE )
}
然后我将它们绑定:
NewDfs <- rbind(newdf1,setNames(newdf2, names(newdf1)),
setNames(newdf3, names(newdf1)))
names(NewDfs ) <- c("id", "order", "educDebt", "educDebt_r",
"order", "bankDebt", "bankDebt_r")
从这个dataframe中,我在一个dataframe中提取了关于教育的债务,在另一个dataframe中提取了关于银行的债务,只保留完整的案例,并通过id
和order
将它们合并在一起。
educ <- NewDfs [,c(1:4)]
bank <- NewDfs [,c(1,5:7)]
educ <- educ[complete.cases(educ), ]
bank <- bank[complete.cases(bank), ]
我还用原始数据集的前三列创建了一个数据帧。
df_household <- df[,1:3]
并将其与 educ_bank
数据框合并。
dfMerged <- merge(df_hog, educ_bank, by = c("id", "order"), all.x = TRUE)
id order age educDebt educDebt_r bankDebt bankDebt_r
1 1 54 student loan yes car loan yes
1 2 20 <NA> <NA> <NA> <NA>
1 3 23 student fund no <NA> <NA>
1 4 17 <NA> <NA> car loan no
2 1 60 <NA> <NA> <NA> <NA>
2 2 57 <NA> <NA> car loan no
2 3 28 student loan no consumer loan yes
2 4 33 bank credit yes <NA> <NA>
2 5 19 bank credit no <NA> <NA>
显然,这似乎不是最直接的方法,我想知道是否有更简单的方法来实现相同的 tidyr
。
我没有完全tidyr
(和dplyr
)的解决方案,但也许更熟悉它的人可以提供帮助。 (还有空间包含更多 tidyverse
,特别是 purrr
,以替换一些基本 R 代码,但我认为没有必要。)我将在底.
数据
首先,我认为某些列的名称有误(小写"debt"),所以我修正了它;这不是绝对重要的,但它使某些事情变得容易得多。我还禁用因子,因为某些操作(在下面的 debt
上)需要字符串。如果拥有 factor
很重要,我建议您在此过程后重新 factor
。
df <- data.frame(
id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)
步进
最终,我们将在 age
中合并,由于所有受访者都由 id
和 order
标识,因此我们将三者分开:
maintbl <- select(df, id, order, age)
(对我而言)首先要意识到的是,您需要从宽到高转换,但要针对每三个列组分别进行转换。我将从第一组三个开始:
grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order)
# id debt order
# 1 1 educDebt1 1
# 2 1 educDebt2 3
# 3 2 educDebt1 3
# 4 2 educDebt3 4
# 5 2 educDebt2 5
(顺便说一句:我使用 grp
的原因稍后会很明显。)
(顺便说一句 2:我使用正则表达式 [0-9]+
来匹配一个 或更多 数字,以防扩展到包括超过 9 或 "arbitrary" 编号。感觉可以随意省略 +
.)
这看起来不错。我们现在需要 cbind
这三个的 *t
变体:
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt))
# id debt type
# 1 1 educDebt1 student loan
# 2 2 educDebt1 student loan
# 3 1 educDebt2 student fund
# 4 2 educDebt2 bank credit
# 5 2 educDebt3 bank credit
我更改了 debt
以删除结尾的 t
,因为稍后我将使用它作为合并列。我对第三组三列("educDebt"),t_r
列做同样的事情。
这三列需要合并,所以这里我将它们放在一个列表中,Reduce
它们:
Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
# id debt order type r
# 1 1 educDebt1 1 student loan yes
# 2 1 educDebt2 3 student fund no
# 3 2 educDebt1 3 student loan no
# 4 2 educDebt3 4 bank credit yes
# 5 2 educDebt2 5 bank credit no
我需要重命名最后两列,因为我已经完成了 type
和 r
列的合并,所以我可以删除 debt
。 (我通常建议 dplyr::rename_
,但由于它很快就会被弃用,所以我手动进行。如果您的列比此处显示的多得多,您可能需要调整列编号等。)
最后,我们需要为每个 "educDebt"
和 "bankDebt"
执行此操作,通过 id
和 order
加入它们(使用另一个 Reduce
) , 最后在 age
.
中重新合并
TL;DR
Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
lapply(c("educDebt", "bankDebt"), function(grp) {
ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
select(ret, -debt)
})
) %>%
left_join(maintbl, ., by = c("id", "order"))
# id order age educDebt educDebt_r bankDebt bankDebt_r
# 1 1 1 54 student loan yes car loan yes
# 2 1 2 20 <NA> <NA> <NA> <NA>
# 3 1 3 23 student fund no <NA> <NA>
# 4 1 4 17 <NA> <NA> <NA> <NA>
# 5 2 1 60 <NA> <NA> <NA> <NA>
# 6 2 2 57 <NA> <NA> <NA> <NA>
# 7 2 3 28 student loan no consumer loan yes
# 8 2 4 33 bank credit yes <NA> <NA>
# 9 2 5 19 bank credit no <NA> <NA>
我有一个家庭观察数据集;每个家庭中都有个人。每个家庭的人数不同。家庭用 id
标识,家庭成员根据他们接受采访的顺序标识。因此,如果家庭 1 有 4 个成员,则变量 id
在所有成员中都是相同的,但变量 order
从 1 变为 4。我遇到的问题是,对于某些变量,只有第一个家庭成员为其他成员作答;因此我的数据集中混合了长格式和宽格式。
我需要做的是将第一个家庭成员回答的值分配给相应的家庭成员。为了进一步解释我的数据结构,我将给出以下玩具示例:
df <- data.frame( id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA,
"student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA,
"bank credit", NA, NA, NA, NA),
educdebt3t = c(NA, NA, NA, NA,
"bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA,
"consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA,
"car loan", NA, NA, NA, NA),
bankdebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA))
我只显示了一些列,以免页面混乱。
id order age educDebt1 educDebt2 educDebt3 educDebt1t educDebt2t educdebt3A
1 1 54 1 3 NA student loan student fund NA
1 2 20 NA NA NA NA NA NA
1 3 23 NA NA NA NA NA NA
1 4 17 NA NA NA NA NA NA
2 1 60 3 5 4 student loan bank credit bank credit
2 2 57 NA NA NA NA NA NA
2 3 28 NA NA NA NA NA NA
2 4 33 NA NA NA NA NA NA
2 5 19 NA NA NA NA NA NA
在上面的玩具示例中,我有一个家庭级别变量 id
和个人级别变量: order
对应于家庭中个人的顺序; age
是他们的年龄。其他变量对应于债务。对于每种类型的债务,一个家庭最多可以报告三笔债务。在这种情况下,有两种类型的债务,教育债务educDebt
或银行债务bankdebt
(上面只显示一种类型)。
所以在每个家庭中,只有 order == 1
对应的成员为家庭中的其他成员回答。在 educDebt1
到 educDebt3
中,该值对应于有债务的家庭成员,因此,如果我们看一下第一行,它表示家庭 1 的家庭成员 1 具有受教育程度债务,以及家庭成员 3。然后,从 educDebt1t
到 educDebt3t
,它告诉家庭成员有哪种债务。住户2中,有3人欠债,住户成员:3、5、4
然后我们还有一种债务,银行债务,逻辑和之前一样。
我想要完成的,就是让每个家庭成员和他们的债务都井井有条,就像这样:
id order age educDebt educDebt_r bankDebt bankDebt_r
1 1 54 student loan yes car loan yes
1 2 20 NA NA NA NA
1 3 23 student fund no NA NA
1 4 17 NA NA car loan no
2 1 60 NA NA NA NA
2 2 57 NA NA car loan no
2 3 28 student loan no consumer loan yes
2 4 33 bank credit yes NA NA
2 5 19 bank credit no NA NA
为了实现这一点,我实际上将数据划分在不同的表中,一个包含前三个变量,另一个包含每种类型的债务。对于债务表,我只保留了受访成员的行,并将数据重塑为长格式,使每一行成为一个家庭成员,然后我按家庭和家庭成员 ID 合并表格,但有很多债务类型,并且我的方法效率很低。有什么办法可以用 tidyr 包达到同样的效果吗?
我的方法如下:
首先,我创建了三个数据框,为每一行提取不同的列索引。我用 for 循环做到了。
newdf1 <- data.frame()
ind <- c(1,seq(4,19, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf1 <- rbind(newdf1, fila, stringsAsFactors = FALSE )
}
newdf2 <- data.frame()
ind <- c(1,seq(5,20, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf2 <- rbind(newdf2, fila, stringsAsFactors = FALSE )
}
newdf3 <- data.frame()
ind <- c(1,seq(6,21, 3))
for(j in 1:nrow(df)){
fila <- c()
for(i in 1:length(ind)){
dato <- as.character(df[j,ind[i]])
fila <- c(fila, dato)
}
newdf3 <- rbind(newdf3, fila, stringsAsFactors = FALSE )
}
然后我将它们绑定:
NewDfs <- rbind(newdf1,setNames(newdf2, names(newdf1)),
setNames(newdf3, names(newdf1)))
names(NewDfs ) <- c("id", "order", "educDebt", "educDebt_r",
"order", "bankDebt", "bankDebt_r")
从这个dataframe中,我在一个dataframe中提取了关于教育的债务,在另一个dataframe中提取了关于银行的债务,只保留完整的案例,并通过id
和order
将它们合并在一起。
educ <- NewDfs [,c(1:4)]
bank <- NewDfs [,c(1,5:7)]
educ <- educ[complete.cases(educ), ]
bank <- bank[complete.cases(bank), ]
我还用原始数据集的前三列创建了一个数据帧。
df_household <- df[,1:3]
并将其与 educ_bank
数据框合并。
dfMerged <- merge(df_hog, educ_bank, by = c("id", "order"), all.x = TRUE)
id order age educDebt educDebt_r bankDebt bankDebt_r
1 1 54 student loan yes car loan yes
1 2 20 <NA> <NA> <NA> <NA>
1 3 23 student fund no <NA> <NA>
1 4 17 <NA> <NA> car loan no
2 1 60 <NA> <NA> <NA> <NA>
2 2 57 <NA> <NA> car loan no
2 3 28 student loan no consumer loan yes
2 4 33 bank credit yes <NA> <NA>
2 5 19 bank credit no <NA> <NA>
显然,这似乎不是最直接的方法,我想知道是否有更简单的方法来实现相同的 tidyr
。
我没有完全tidyr
(和dplyr
)的解决方案,但也许更熟悉它的人可以提供帮助。 (还有空间包含更多 tidyverse
,特别是 purrr
,以替换一些基本 R 代码,但我认为没有必要。)我将在底.
数据
首先,我认为某些列的名称有误(小写"debt"),所以我修正了它;这不是绝对重要的,但它使某些事情变得容易得多。我还禁用因子,因为某些操作(在下面的 debt
上)需要字符串。如果拥有 factor
很重要,我建议您在此过程后重新 factor
。
df <- data.frame(
id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)
步进
最终,我们将在 age
中合并,由于所有受访者都由 id
和 order
标识,因此我们将三者分开:
maintbl <- select(df, id, order, age)
(对我而言)首先要意识到的是,您需要从宽到高转换,但要针对每三个列组分别进行转换。我将从第一组三个开始:
grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order)
# id debt order
# 1 1 educDebt1 1
# 2 1 educDebt2 3
# 3 2 educDebt1 3
# 4 2 educDebt3 4
# 5 2 educDebt2 5
(顺便说一句:我使用 grp
的原因稍后会很明显。)
(顺便说一句 2:我使用正则表达式 [0-9]+
来匹配一个 或更多 数字,以防扩展到包括超过 9 或 "arbitrary" 编号。感觉可以随意省略 +
.)
这看起来不错。我们现在需要 cbind
这三个的 *t
变体:
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt))
# id debt type
# 1 1 educDebt1 student loan
# 2 2 educDebt1 student loan
# 3 1 educDebt2 student fund
# 4 2 educDebt2 bank credit
# 5 2 educDebt3 bank credit
我更改了 debt
以删除结尾的 t
,因为稍后我将使用它作为合并列。我对第三组三列("educDebt"),t_r
列做同样的事情。
这三列需要合并,所以这里我将它们放在一个列表中,Reduce
它们:
Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
# id debt order type r
# 1 1 educDebt1 1 student loan yes
# 2 1 educDebt2 3 student fund no
# 3 2 educDebt1 3 student loan no
# 4 2 educDebt3 4 bank credit yes
# 5 2 educDebt2 5 bank credit no
我需要重命名最后两列,因为我已经完成了 type
和 r
列的合并,所以我可以删除 debt
。 (我通常建议 dplyr::rename_
,但由于它很快就会被弃用,所以我手动进行。如果您的列比此处显示的多得多,您可能需要调整列编号等。)
最后,我们需要为每个 "educDebt"
和 "bankDebt"
执行此操作,通过 id
和 order
加入它们(使用另一个 Reduce
) , 最后在 age
.
TL;DR
Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
lapply(c("educDebt", "bankDebt"), function(grp) {
ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
select(ret, -debt)
})
) %>%
left_join(maintbl, ., by = c("id", "order"))
# id order age educDebt educDebt_r bankDebt bankDebt_r
# 1 1 1 54 student loan yes car loan yes
# 2 1 2 20 <NA> <NA> <NA> <NA>
# 3 1 3 23 student fund no <NA> <NA>
# 4 1 4 17 <NA> <NA> <NA> <NA>
# 5 2 1 60 <NA> <NA> <NA> <NA>
# 6 2 2 57 <NA> <NA> <NA> <NA>
# 7 2 3 28 student loan no consumer loan yes
# 8 2 4 33 bank credit yes <NA> <NA>
# 9 2 5 19 bank credit no <NA> <NA>