连续NA数
Number of consecutive NA
数据是这样的
subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05
如何将新变量 "the number of MAX consecutive NA" 添加到此 data.frame?
subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
b NA NA -0.01 NA 0.3 0.8 0.01 2 (max NA, not 1!!)
c NA NA NA NA NA 0.9 0.4 5
d NA NA 0.01 NA NA NA 0.05 3 (max NA, not 2!!)
我想按主题(即一行)计算连续 NA 的数量。
简单地说,我尝试使用 duplicate
但它显示了任何重复的内容,包括正常值,而不是 NA。
如果我将此数据集转换为 "long"、df %>% gather(variable, value, -subject)
subject variable value
1 a x1 0.1
2 a x2 NA
3 a x3 0.2
4 a x4 0.1
5 a x5 0.1
6 a x6 NA
7 a x7 0.9
8 b x1 NA
9 b x2 NA
10 b x3 -0.01
..
这个表格比较简单吗?
我不在乎任何形状,我应该得到新的信息(MAX 连续 NA)。
如果可能,避免"for loop"(但不是完全),因为这个数据非常大。
这是使用 data.table
的建议解决方案。如果 OP 只想要 tidyverse
解决方案:
,我会把它记下来
#count number of consecutive NAs by converting into long format and
#using rle to count consective NAs and then extract longest length
consecNA <- melt(dat, id.vars="subject")[, {
r <- rle(is.na(value))
max(r$lengths[r$values])
}, by=.(subject)]
#perform an update join (i.e. a lookup)
dat[consecNA, NA_consecutive := V1, on=.(subject)]
dat
另一种可能的方法是:
dat[, NA_cons := apply(.SD, 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
}), by=.(subject)]
或等价于基数 R:
dat$NA_cons <- apply(dat[, paste0("x", 1:7)], 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
})
数据:
library(data.table)
dat <- fread("subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05")
cols <- paste0("x", 1:7)
dat[, (cols) := lapply(.SD, as.numeric), .SDcols=cols]
df$NA_consecutive <- apply(df[-1], 1, function(x) max(rle(is.na(x))$lengths[rle(is.na(x))$values]))
df
# # A tibble: 4 x 9
# subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
# <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
# 2 b NA NA -0.01 NA 0.3 0.8 0.01 2
# 3 c NA NA NA NA NA 0.9 0.4 5
# 4 d NA NA 0.01 NA NA NA 0.05 3
数据:
df <- data.frame(
subject = c("a", "b", "c", "d"),
x1 = c(.1, rep(NA, 3)),
x2 = rep(NA, 4),
x3 = c(.2, -.01, NA, .01),
x4 = c(.1, rep(NA, 3)),
x5 = c(.1, .3, NA, NA),
x6 = c(NA, .8, .9, NA),
x7 = c(.9, .01, .4, .05)
)
这里有一个tidyverse
选项
df %>%
gather(k, v, -subject) %>%
arrange(subject, k) %>%
group_by(subject) %>%
mutate(grp = cumsum(c(0, abs(diff(!is.na(v))) == 1))) %>%
add_count(subject, grp) %>%
mutate(NA_consecutive = max(n[is.na(v)])) %>%
select(-grp, -n) %>%
spread(k, v)
## A tibble: 4 x 9
## Groups: subject [4]
# subject NA_consecutive x1 x2 x3 x4 x5 x6 x7
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 1 0.100 NA 0.200 0.100 0.100 NA 0.900
#2 b 2 NA NA -0.0100 NA 0.300 0.800 0.0100
#3 c 5 NA NA NA NA NA 0.900 0.400
#4 d 3 NA NA 0.0100 NA NA NA 0.0500
数据是这样的
subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05
如何将新变量 "the number of MAX consecutive NA" 添加到此 data.frame?
subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
b NA NA -0.01 NA 0.3 0.8 0.01 2 (max NA, not 1!!)
c NA NA NA NA NA 0.9 0.4 5
d NA NA 0.01 NA NA NA 0.05 3 (max NA, not 2!!)
我想按主题(即一行)计算连续 NA 的数量。
简单地说,我尝试使用 duplicate
但它显示了任何重复的内容,包括正常值,而不是 NA。
如果我将此数据集转换为 "long"、df %>% gather(variable, value, -subject)
subject variable value
1 a x1 0.1
2 a x2 NA
3 a x3 0.2
4 a x4 0.1
5 a x5 0.1
6 a x6 NA
7 a x7 0.9
8 b x1 NA
9 b x2 NA
10 b x3 -0.01
..
这个表格比较简单吗?
我不在乎任何形状,我应该得到新的信息(MAX 连续 NA)。
如果可能,避免"for loop"(但不是完全),因为这个数据非常大。
这是使用 data.table
的建议解决方案。如果 OP 只想要 tidyverse
解决方案:
#count number of consecutive NAs by converting into long format and
#using rle to count consective NAs and then extract longest length
consecNA <- melt(dat, id.vars="subject")[, {
r <- rle(is.na(value))
max(r$lengths[r$values])
}, by=.(subject)]
#perform an update join (i.e. a lookup)
dat[consecNA, NA_consecutive := V1, on=.(subject)]
dat
另一种可能的方法是:
dat[, NA_cons := apply(.SD, 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
}), by=.(subject)]
或等价于基数 R:
dat$NA_cons <- apply(dat[, paste0("x", 1:7)], 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
})
数据:
library(data.table)
dat <- fread("subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05")
cols <- paste0("x", 1:7)
dat[, (cols) := lapply(.SD, as.numeric), .SDcols=cols]
df$NA_consecutive <- apply(df[-1], 1, function(x) max(rle(is.na(x))$lengths[rle(is.na(x))$values]))
df
# # A tibble: 4 x 9
# subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
# <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
# 2 b NA NA -0.01 NA 0.3 0.8 0.01 2
# 3 c NA NA NA NA NA 0.9 0.4 5
# 4 d NA NA 0.01 NA NA NA 0.05 3
数据:
df <- data.frame(
subject = c("a", "b", "c", "d"),
x1 = c(.1, rep(NA, 3)),
x2 = rep(NA, 4),
x3 = c(.2, -.01, NA, .01),
x4 = c(.1, rep(NA, 3)),
x5 = c(.1, .3, NA, NA),
x6 = c(NA, .8, .9, NA),
x7 = c(.9, .01, .4, .05)
)
这里有一个tidyverse
选项
df %>%
gather(k, v, -subject) %>%
arrange(subject, k) %>%
group_by(subject) %>%
mutate(grp = cumsum(c(0, abs(diff(!is.na(v))) == 1))) %>%
add_count(subject, grp) %>%
mutate(NA_consecutive = max(n[is.na(v)])) %>%
select(-grp, -n) %>%
spread(k, v)
## A tibble: 4 x 9
## Groups: subject [4]
# subject NA_consecutive x1 x2 x3 x4 x5 x6 x7
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 1 0.100 NA 0.200 0.100 0.100 NA 0.900
#2 b 2 NA NA -0.0100 NA 0.300 0.800 0.0100
#3 c 5 NA NA NA NA NA 0.900 0.400
#4 d 3 NA NA 0.0100 NA NA NA 0.0500