使用嵌套的重复度量收集多列
gather multiple columns with nested, repeated measures
我有一个不同类型(type2=c("dad", "mom", "kid"
;为方便起见,type=c("a", "b", "c")
)的人(pid
)嵌套在家庭(hid
)中的数据集测量值(time
)。
- 一些变量,如
v1_
是向所有人询问的,但这些值分布在三列中。例如,v1_a
包含所有父亲 (type==a
) 的值。
- 像
v2_
这样的变量只询问爸爸和妈妈(a 和 b),值分布在两列中。
- 像
v3
这样的变量也只问爸爸妈妈,但值包含在一列中。
- 像
v4
这样的变量问给大家,值都包含在一列中。
有:
hid pid type type2 time v1_a v1_b v1_c v2_a v2_b v3 v4
1 1 1 a dad 1 6 NA NA 2 NA 4 3
2 1 2 b mom 1 NA 2 NA NA 5 6 6
3 1 3 c kid 1 NA NA 1 NA NA NA 5
4 2 4 a dad 1 3 NA NA 6 NA 2 6
5 2 5 b mom 1 NA 5 NA NA 2 4 3
6 2 6 c kid 1 NA NA 3 NA NA NA 5
7 1 1 a dad 2 3 NA NA 2 NA 4 3
8 1 2 b mom 2 NA 3 NA NA 5 6 6
9 1 3 c kid 2 NA NA 2 NA NA NA 5
10 2 4 a dad 2 2 NA NA 6 NA 2 6
11 2 5 b mom 2 NA 3 NA NA 2 4 3
12 2 6 c kid 2 NA NA 2 NA NA NA 5
这是我想要的最终结果:
hid pid type type2 time v1 v2 v3 v4
1 1 1 a dad 1 6 2 4 3
2 1 2 b mom 1 2 5 6 6
3 1 3 c kid 1 1 NA NA 5
4 2 4 a dad 1 3 6 2 6
5 2 5 b mom 1 5 2 4 3
6 2 6 c kid 1 3 NA NA 5
7 1 1 a dad 2 3 2 4 3
8 1 2 b mom 2 3 5 6 6
9 1 3 c kid 2 2 NA NA 5
10 2 4 a dad 2 2 6 2 6
11 2 5 b mom 2 3 2 4 3
12 2 6 c kid 2 2 NA NA 5
我正在寻找一种 tidyverse
方法来处理更大的混合变量实际用例,如此处所示。变量命名是一致的。 gather()
之后我要去哪里?
library(tidyverse)
df_have <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1_a=c(6, NA, NA, 3, NA, NA,
3, NA, NA, 2, NA, NA),
v1_b=c(NA, 2, NA, NA, 5, NA,
NA, 3, NA, NA, 3, NA),
v1_c=c(NA, NA, 1, NA, NA, 3,
NA, NA, 2, NA, NA, 2),
v2_a=c(2, NA, NA, 6, NA, NA,
2, NA, NA, 6, NA, NA),
v2_b=c(NA, 5, NA, NA, 2, NA,
NA, 5, NA, NA, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_want <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1=c(6, 2, 1, 3, 5, 3,
3, 3, 2, 2, 3, 2),
v2=c(2, 5, NA, 6, 2, NA,
2, 5, NA, 6, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_have %>%
gather(key, value, -hid, -pid, -type, -type2, -time)
这让我明白了,但 filter(!is.na(value))
这一步似乎很麻烦。更好的想法?
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
filter(!is.na(value)) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
来自@www 的更新:
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2, na.rm=TRUE) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
这是使用 dplyr
中的 coalesce
和 purrr
中的 map
的另一个想法。
library(tidyverse)
# Set target column names
cols <- paste0("v", 1:4)
# Coalesce the numbers based on column names
nums <- map(cols, ~coalesce(!!!as.list(df_have %>% select(starts_with(.x)))))
# Create a data frame
nums_df <- nums %>%
setNames(cols) %>%
as_data_frame()
# Create the final output by bind_cols
df_test <- df_have %>%
select(-starts_with("v")) %>%
bind_cols(nums_df)
df_test
# hid pid type type2 time v1 v2 v3 v4
# 1 1 1 a dad 1 6 2 4 3
# 2 1 2 b mom 1 2 5 6 6
# 3 1 3 c kid 1 1 NA NA 5
# 4 2 4 a dad 1 3 6 2 6
# 5 2 5 b mom 1 5 2 4 3
# 6 2 6 c kid 1 3 NA NA 5
# 7 1 1 a dad 2 3 2 4 3
# 8 1 2 b mom 2 3 5 6 6
# 9 1 3 c kid 2 2 NA NA 5
# 10 2 4 a dad 2 2 6 2 6
# 11 2 5 b mom 2 3 2 4 3
# 12 2 6 c kid 2 2 NA NA 5
我有一个不同类型(type2=c("dad", "mom", "kid"
;为方便起见,type=c("a", "b", "c")
)的人(pid
)嵌套在家庭(hid
)中的数据集测量值(time
)。
- 一些变量,如
v1_
是向所有人询问的,但这些值分布在三列中。例如,v1_a
包含所有父亲 (type==a
) 的值。 - 像
v2_
这样的变量只询问爸爸和妈妈(a 和 b),值分布在两列中。 - 像
v3
这样的变量也只问爸爸妈妈,但值包含在一列中。 - 像
v4
这样的变量问给大家,值都包含在一列中。
有:
hid pid type type2 time v1_a v1_b v1_c v2_a v2_b v3 v4
1 1 1 a dad 1 6 NA NA 2 NA 4 3
2 1 2 b mom 1 NA 2 NA NA 5 6 6
3 1 3 c kid 1 NA NA 1 NA NA NA 5
4 2 4 a dad 1 3 NA NA 6 NA 2 6
5 2 5 b mom 1 NA 5 NA NA 2 4 3
6 2 6 c kid 1 NA NA 3 NA NA NA 5
7 1 1 a dad 2 3 NA NA 2 NA 4 3
8 1 2 b mom 2 NA 3 NA NA 5 6 6
9 1 3 c kid 2 NA NA 2 NA NA NA 5
10 2 4 a dad 2 2 NA NA 6 NA 2 6
11 2 5 b mom 2 NA 3 NA NA 2 4 3
12 2 6 c kid 2 NA NA 2 NA NA NA 5
这是我想要的最终结果:
hid pid type type2 time v1 v2 v3 v4
1 1 1 a dad 1 6 2 4 3
2 1 2 b mom 1 2 5 6 6
3 1 3 c kid 1 1 NA NA 5
4 2 4 a dad 1 3 6 2 6
5 2 5 b mom 1 5 2 4 3
6 2 6 c kid 1 3 NA NA 5
7 1 1 a dad 2 3 2 4 3
8 1 2 b mom 2 3 5 6 6
9 1 3 c kid 2 2 NA NA 5
10 2 4 a dad 2 2 6 2 6
11 2 5 b mom 2 3 2 4 3
12 2 6 c kid 2 2 NA NA 5
我正在寻找一种 tidyverse
方法来处理更大的混合变量实际用例,如此处所示。变量命名是一致的。 gather()
之后我要去哪里?
library(tidyverse)
df_have <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1_a=c(6, NA, NA, 3, NA, NA,
3, NA, NA, 2, NA, NA),
v1_b=c(NA, 2, NA, NA, 5, NA,
NA, 3, NA, NA, 3, NA),
v1_c=c(NA, NA, 1, NA, NA, 3,
NA, NA, 2, NA, NA, 2),
v2_a=c(2, NA, NA, 6, NA, NA,
2, NA, NA, 6, NA, NA),
v2_b=c(NA, 5, NA, NA, 2, NA,
NA, 5, NA, NA, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_want <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1=c(6, 2, 1, 3, 5, 3,
3, 3, 2, 2, 3, 2),
v2=c(2, 5, NA, 6, 2, NA,
2, 5, NA, 6, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_have %>%
gather(key, value, -hid, -pid, -type, -type2, -time)
这让我明白了,但 filter(!is.na(value))
这一步似乎很麻烦。更好的想法?
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
filter(!is.na(value)) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
来自@www 的更新:
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2, na.rm=TRUE) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
这是使用 dplyr
中的 coalesce
和 purrr
中的 map
的另一个想法。
library(tidyverse)
# Set target column names
cols <- paste0("v", 1:4)
# Coalesce the numbers based on column names
nums <- map(cols, ~coalesce(!!!as.list(df_have %>% select(starts_with(.x)))))
# Create a data frame
nums_df <- nums %>%
setNames(cols) %>%
as_data_frame()
# Create the final output by bind_cols
df_test <- df_have %>%
select(-starts_with("v")) %>%
bind_cols(nums_df)
df_test
# hid pid type type2 time v1 v2 v3 v4
# 1 1 1 a dad 1 6 2 4 3
# 2 1 2 b mom 1 2 5 6 6
# 3 1 3 c kid 1 1 NA NA 5
# 4 2 4 a dad 1 3 6 2 6
# 5 2 5 b mom 1 5 2 4 3
# 6 2 6 c kid 1 3 NA NA 5
# 7 1 1 a dad 2 3 2 4 3
# 8 1 2 b mom 2 3 5 6 6
# 9 1 3 c kid 2 2 NA NA 5
# 10 2 4 a dad 2 2 6 2 6
# 11 2 5 b mom 2 3 2 4 3
# 12 2 6 c kid 2 2 NA NA 5