如何按参与者编号和时间点将交错数据折叠到 R 中的一行?

How do I collapse staggered data to one row in R by participant number and timepoint?

如果之前有人问过这个问题,我深表歉意——我找不到了。我有一个数据集,每个参与者完成的每项调查都在其自己的行中。每个时间点每个参与者大约有 10 行。每个时间点每个参与者需要一行。这是一些测试数据:

x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
                event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
                S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
                S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
                S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
                S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
                race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))

    time PartNum    event S1Q1 S1Q2     S1date S2Q1 S2Q2     S2date race age
1 week_1       1     <NA>   NA   NA       <NA>   NA   NA       <NA>    0  60
2 week_1       1 Survey_1    3    4 2020-03-02   NA   NA       <NA>   NA  NA
3 week_1       1 Survey 2   NA   NA       <NA>    5    3 2020-03-02   NA  NA
4 week_1       2     <NA>   NA   NA       <NA>   NA   NA       <NA>    1  58
5 week_1       2 Survey_1    1    2 2020-03-04   NA   NA       <NA>   NA  NA
6 week_1       2 Survey 2   NA   NA       <NA>    3    2 2020-03-04   NA  NA

如何让每个参与者和时间点的调查 1 和调查 2 以及人口统计数据都在一行中(注意:测试数据中只显示一个时间点以保存 space)?

期望的结果:

desired_x <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1),
                        S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"),
                        S2Q1 = c(5,3), S2Q2 = c(3,2),
                        S2date = c("2020-03-02","2020-03-04"),
                        race = c(0,1), age = c(60,58))

    time PartNum S1Q1 S1Q2     S1date S2Q1 S2Q2     S2date race age
1 week_1       1    3    4 2020-03-02    5    3 2020-03-02    0  60
2 week_1       2    1    2 2020-03-04    3    2 2020-03-04    1  58

我在这个网站上阅读了很多答案,但这是我的第一个问题。感谢您这次的耐心和帮助,以及您过去在不知不觉中给予我的帮助。

我认为获得你想要的东西的最好方法是首先编写一个自定义函数来仅 return 非 NA 值,然后使用 dplyr 函数按时间和 PartNum 进行汇总.这是使用您的数据的示例

##Loading dplyr package##
library(dplyr)

##Example Data## 
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2), event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"), S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA), S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA), S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2), S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"), race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))

##Function to return only non-NA values##
fxn<-function(vec){
  out<-vec[!is.na(vec)]
  return(out)
}

##Summarizing the data using the new function##
#We'll want to get rid of the event column, hence the x[,-3]##
DF<-as.data.frame(x[,-3] %>% group_by(time, PartNum) %>% summarise_all(fxn))

##See the results##
DF

##Compare to your desired output##
y <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1), S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"), S2Q1 = c(5,3), S2Q2 = c(3,2), S2date = c("2020-03-02","2020-03-04"), race = c(0,1), age = c(60,58))

y

祝你好运! 小心, -肖恩

已编辑:不依赖自定义函数的更简单版本

使用 na.omit 仅获取有效观察结果(根据 time/partnum)

x %>% select(-event) %>% 
  group_by(time, PartNum) %>% 
  summarise_all(na.omit)

以前的版本:

以下内容将使用 dplyr 解决您的问题:

x_clean <- x %>%                       # (1)
  select(-event) %>%                   # (2)  
  group_by(time, PartNum) %>%          # (3)
  mutate(across(.cols = everything(),  # (4)
                .fns = getmode)) %>% 
  distinct()                           # (5)

每一步都可以理解为: 0) 选择数据集 x,然后

  1. 从数据集中删除变量 event,然后(将 %>% 读作“然后”
  2. timePartNum 分组,然后
  3. 对所有(分组的)变量进行变异并获取每个变量的模式(根据 timePartNum。这将用每个分组中最常见的观察结果替换 NA。 如果你停在这里,你会得到每个分组的重复行,所以最后
  4. 仅从结果数据集中获取不同的行。
  5. 结果数据集赋值给x_clean

复制的全部代码

## your data.frame
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
                event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
                S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
                S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
                S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
                S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
                race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))


# helper function that works for numeric and character data
# will retrieve the most common value. 
getmode <- function(v, na.rm = TRUE) {
  if (na.rm) v <- na.exclude(v)
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

## solution 
library(tidyverse)

x_clean <- x %>%                       # (0)
  select(-event) %>%                   # (1)  
  group_by(time, PartNum) %>%          # (2)
  mutate(across(.cols = everything(),  # (3)
                .fns = getmode)) %>% 
  distinct()                           # (4)
x_clean
#> # A tibble: 2 x 10
#> # Groups:   time, PartNum [2]
#>   time   PartNum  S1Q1  S1Q2 S1date      S2Q1  S2Q2 S2date      race   age
#>   <chr>    <dbl> <dbl> <dbl> <chr>      <dbl> <dbl> <chr>      <dbl> <dbl>
#> 1 week_1       1     3     4 2020-03-02     5     3 2020-03-02     0    60
#> 2 week_1       2     1     2 2020-03-04     3     2 2020-03-04     1    58