当 R 中有多个值选项时，在每个 ID 中重复一个值

Question

我在 R 中有一个数据集，其中包含不同 ID 内的多个高度观测值。对于某些 ID，有几种不同的高度度量，而对于某些 ID，只有一种。对于每个 ID 中的大多数 observations/rows，高度值缺失（编码为 NA）。我想创建一个新变量，它采用每个 ID 可用的第一个高度测量值，并为该 ID 的所有 rows/observations 重复它（不同的 ID 具有不同的总行数）。我已经尝试使用填充、变异和命令，但我正在努力使其工作。

目前我的数据是这样的：

data = data.frame(id = c(1,1,1,2,2,3,3,3,3), 
                 height = c(150, NA, NA, NA, 148, NA, 152, 151, NA))

# id height
# 1  1    150
# 2  1     NA
# 3  1     NA
# 4  2     NA
# 5  2    148
# 6  3     NA
# 7  3    152
# 8  3    151
# 9  3     NA

理想情况下，我希望能够添加一个变量 (height_filled)，因此它看起来像这样：

data = data.frame(id = c(1,1,1,2,2,3,3,3,3),
                  height = c(150, NA, NA, NA, 148, NA, 152, 151, NA),
                  height_filled = c(150, 150, 150, 148, 148, 152, 152, 152, 152))

# id height height_filled
# 1  1    150           150
# 2  1     NA           150
# 3  1     NA           150
# 4  2     NA           148
# 5  2    148           148
# 6  3     NA           152
# 7  3    152           152
# 8  3    151           152
# 9  3     NA           152

非常感谢任何帮助！

Answer 1

要么我们可以按 'id'、arrange 'id' 和 'height' 中的 NA 分组并使用 cummax

library(dplyr)
data %>%
   group_by(id) %>%
   arrange(id, is.na(height)) %>% 
   mutate(height_filled = cummax(replace(height, is.na(height), 0))) %>%
   ungroup

-输出

# A tibble: 9 x 3
#     id height height_filled
#  <dbl>  <dbl>         <dbl>
#1     1    150           150
#2     1     NA           150
#3     1     NA           150
#4     2    148           148
#5     2     NA           148
#6     3    152           152
#7     3    151           152
#8     3     NA           152
#9     3     NA           152

或在按 'id'

分组的 'height' 上使用 max

data %>%
   group_by(id) %>%
   mutate(height_filled = max(height, na.rm = TRUE)) %>%
   ungroup

-输出

# A tibble: 9 x 3
#     id height height_filled
#  <dbl>  <dbl>         <dbl>
#1     1    150           150
#2     1     NA           150
#3     1     NA           150
#4     2     NA           148
#5     2    148           148
#6     3     NA           152
#7     3    152           152
#8     3    151           152
#9     3     NA           152

Answer 2

我会尝试以下方法。按id分组后，对height使用na.omit去除缺失值，使用first对select去除NA后可用的第一个高度.

library(dplyr)

data %>%
  group_by(id) %>%
  mutate(height_filled = first(na.omit(height)))

输出

     id height height_filled
  <dbl>  <dbl>         <dbl>
1     1    150           150
2     1     NA           150
3     1     NA           150
4     2     NA           148
5     2    148           148
6     3     NA           152
7     3    152           152
8     3    151           152
9     3     NA           152

Answer 3

data.table 选项使用 first + na.omit

setDT(data)[, height_filled := first(na.omit(height)), id]

给予

   id height height_filled
1:  1    150           150
2:  1     NA           150
3:  1     NA           150
4:  2     NA           148
5:  2    148           148
6:  3     NA           152
7:  3    152           152
8:  3    151           152
9:  3     NA           152

使用 ave

的基础 R 选项

transform(
  data,
  height_filled = ave(height, id, FUN = function(x) head(na.omit(x), 1))
)

给予

  id height height_filled
1  1    150           150
2  1     NA           150
3  1     NA           150
4  2     NA           148
5  2    148           148
6  3     NA           152
7  3    152           152
8  3    151           152
9  3     NA           152

当 R 中有多个值选项时，在每个 ID 中重复一个值

Repeating a value within each ID when there are multiple value options in R

r

multi-level