我可以合并 dplyr mutate_at & mutate_if 语句吗?
Can I combine a dplyr mutate_at & mutate_if statement?
我有以下示例输出:
country country-year year a b
1 France France2000 2000 NA NA
2 France France2001 2001 1000 1000
3 France France2002 2002 NA NA
4 France France2003 2003 1600 2200
5 France France2004 2004 NA NA
6 UK UK2000 2000 1000 1000
7 UK UK2001 2001 NA NA
8 UK UK2002 2002 1000 1000
9 UK UK2003 2003 NA NA
10 UK UK2004 2004 NA NA
11 Germany UK2000 2000 NA NA
12 Germany UK2001 2001 NA NA
13 Germany UK2002 2002 NA NA
14 Germany UK2003 2003 NA NA
15 Germany UK2004 2004 NA NA
我想插入数据 I(但不是外推),并删除列 a
和 b
均为 NA 的列。换句话说,我想删除所有我无法插入的列;在示例中:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
11 Germany UK2000 NA NA
12 Germany UK2001 NA NA
13 Germany UK2002 NA NA
14 Germany UK2003 NA NA
15 Germany UK2004 NA NA
有两个选项几乎可以满足我的要求:
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
和
df%>%
group_by(Country)%>%
mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))
是否可以结合这些代码,做这样的事情:
df <- df%>%
group_by(country)%>%
mutate_at(vars(a:b),~if(all(is.na(.x))) NA else(.x,c(NA, "extend", NA)))
filter(!is.na(df$a | df$a))
期望的输出:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
我知道这并没有直接回答如何结合 mutate_if
和 mutate_at
的问题,但这解决了您的一般问题:
我首先去掉所有a和b都缺失的国家,然后为每个国家确定最小和最大年份,这是不缺失的。过滤完这些后,我使用na.fill
。
library(dplyr)
library(readr)
library(zoo)
country_data %>%
mutate(Year = parse_number(`country-year`)) %>%
group_by(country) %>%
mutate(not_all_na = any(!(is.na(a) & is.na(b)))) %>%
filter(not_all_na) %>%
mutate(Year_min_not_na = min(Year[!(is.na(a) & is.na(b))]),
Year_max_not_na = max(Year[!(is.na(a) & is.na(b))])) %>%
filter(Year >= Year_min_not_na, Year <= Year_max_not_na) %>%
mutate_at(vars(a:b), ~na.fill(.x, "extend"))
# A tibble: 6 x 8
# Groups: country [2]
# country `country-year` a b Year not_all_na Year_min_not_na Year_max_not_na
# <fct> <fct> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
# 1 France France2001 1000 1000 2001 TRUE 2001 2003
# 2 France France2002 1300 1600 2002 TRUE 2001 2003
# 3 France France2003 1600 2200 2003 TRUE 2001 2003
# 4 UK UK2000 1000 1000 2000 TRUE 2000 2002
# 5 UK UK2001 1000 1000 2001 TRUE 2000 2002
# 6 UK UK2002 1000 1000 2002 TRUE 2000 2002
数据
country_data <-
structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L),
.Label = c("France", "Germany", "UK"), class = "factor"),
country.year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 6L, 7L, 8L, 9L, 10L),
.Label = c("France2000", "France2001", "France2002", "France2003",
"France2004", "UK2000", "UK2001", "UK2002", "UK2003", "UK2004"),
class = "factor"),
a = c(NA, 1000L, NA, 1600L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA, NA),
b = c(NA, 1000L, NA, 2200L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -15L))
这是我的看法:
library(data.table)
library(tidyverse)
library(zoo)
df <- fread("
n country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
11 Germany UK2000 NA NA
12 Germany UK2001 NA NA
13 Germany UK2002 NA NA
14 Germany UK2003 NA NA
15 Germany UK2004 NA NA
") %>% select(-n)
# Clean data
df <- df %>%
mutate(year = str_extract_all(`country-year`, "[0-9]{4}$", simplify = T)) %>%
select(country, year, a, b)
# Remove all rows NA in a and b if there is no earlier
# or later row with value for a and b
# I hope this was what you meant with extrapolate :)
df <- df %>%
group_by(country) %>%
filter(year >= min(year[!is.na(a) | !is.na(b)]),
year <= max(year[!is.na(a) | !is.na(b)])) %>%
ungroup()
# Intrapolate
df %>%
mutate_at(vars(a:b), ~na.fill(., "extend"))
结果:
# A tibble: 6 x 4
country year a b
<chr> <chr> <dbl> <dbl>
1 France 2001 1000. 1000.
2 France 2002 1300. 1600.
3 France 2003 1600. 2200.
4 UK 2000 1000. 1000.
5 UK 2001 1000. 1000.
6 UK 2002 1000. 1000.
这是另外两种使用 filter
和 slice
的方法。第一种方法应该最接近 OP 的要求:
library(dplyr)
library(zoo)
df %>%
group_by(country) %>%
mutate_if(is.numeric, na.approx, na.rm = FALSE) %>%
filter(!is.na(a|b))
或 slice
:
df %>%
group_by(country) %>%
filter(any(!is.na(a|b))) %>%
slice(min(which(!is.na(a|b))):max(which(!is.na(a|b)))) %>%
mutate_if(is.numeric, na.approx)
结果:
# A tibble: 6 x 4
# Groups: country [2]
country country.year a b
<fct> <fct> <dbl> <dbl>
1 France France2001 1000 1000
2 France France2002 1300 1600
3 France France2003 1600 2200
4 UK UK2000 1000 1000
5 UK UK2001 1000 1000
6 UK UK2002 1000 1000
数据:
df <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L), .Label = c("France", "Germany",
"UK"), class = "factor"), country.year = structure(c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 6L, 7L, 8L, 9L, 10L), .Label = c("France2000",
"France2001", "France2002", "France2003", "France2004", "UK2000",
"UK2001", "UK2002", "UK2003", "UK2004"), class = "factor"), a = c(NA,
1000L, NA, 1600L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA,
NA), b = c(NA, 1000L, NA, 2200L, NA, 1000L, NA, 1000L, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("country", "country.year", "a",
"b"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"))
不幸的是,@kath onyl 的解决方案在给定示例中有效,但如果只有一列包含数据,则失败,例如:
country country-year year a b
France France2000 2000 NA NA
France France2001 2001 1000 1000
France France2002 2002 NA NA
France France2003 2003 1600 2200
France France2004 2004 NA NA
UK UK2000 2000 1000 1000
UK UK2001 2001 NA NA
UK UK2002 2002 1000 1000
UK UK2003 2003 NA NA
UK UK2004 2004 NA NA
Germany UK2000 2000 NA NA
Germany UK2001 2001 NA 500
Germany UK2002 2002 NA NA
Germany UK2003 2003 NA 1100
Germany UK2004 2004 NA NA
同样不幸的是,OPs 问题的答案是否定的,你不能混合使用 mutate_at 和 mutate_if(没有允许你指定 .predicate 和 .vars 的函数)
但是 您可以在 mutate_at 中使用的函数中使用预测函数。所以这是我使用包含预测函数的 mutate_at 的解决方案:
df %>%
group_by(country) %>%
# Interpolate if at least two non-null values are present
mutate_at(vars(a,b), funs(if(sum(!is.na(.))<2) {NA_real_} else{approx(year, ., year)$y})) %>%
# keep only rows with original or interpolated values in either column a or b
filter_at(vars(a,b), any_vars(!is.na(.)))
dplyr 0.8.3 灵感来自:
library(dplyr)
(iris [1:3,]
%>% mutate_at(c("Petal.Width"),
list(~ifelse(Sepal.Width == 3.5,
.+10,
.+100)
)
)
)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 10.2 setosa
#> 2 4.9 3.0 1.4 100.2 setosa
#> 3 4.7 3.2 1.3 100.2 setosa
新列总计:
library(dplyr)
(iris [1:3,]
%>% mutate_at(c("Petal.Width"),
list(toto=~ifelse(Sepal.Width == 3.5,
.+10,
.+100)
)
)
)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species toto
#> 1 5.1 3.5 1.4 0.2 setosa 10.2
#> 2 4.9 3.0 1.4 0.2 setosa 100.2
#> 3 4.7 3.2 1.3 0.2 setosa 100.2
由 reprex package (v0.2.1)
于 2019-07-30 创建
我有以下示例输出:
country country-year year a b
1 France France2000 2000 NA NA
2 France France2001 2001 1000 1000
3 France France2002 2002 NA NA
4 France France2003 2003 1600 2200
5 France France2004 2004 NA NA
6 UK UK2000 2000 1000 1000
7 UK UK2001 2001 NA NA
8 UK UK2002 2002 1000 1000
9 UK UK2003 2003 NA NA
10 UK UK2004 2004 NA NA
11 Germany UK2000 2000 NA NA
12 Germany UK2001 2001 NA NA
13 Germany UK2002 2002 NA NA
14 Germany UK2003 2003 NA NA
15 Germany UK2004 2004 NA NA
我想插入数据 I(但不是外推),并删除列 a
和 b
均为 NA 的列。换句话说,我想删除所有我无法插入的列;在示例中:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
11 Germany UK2000 NA NA
12 Germany UK2001 NA NA
13 Germany UK2002 NA NA
14 Germany UK2003 NA NA
15 Germany UK2004 NA NA
有两个选项几乎可以满足我的要求:
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
和
df%>%
group_by(Country)%>%
mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))
是否可以结合这些代码,做这样的事情:
df <- df%>%
group_by(country)%>%
mutate_at(vars(a:b),~if(all(is.na(.x))) NA else(.x,c(NA, "extend", NA)))
filter(!is.na(df$a | df$a))
期望的输出:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
我知道这并没有直接回答如何结合 mutate_if
和 mutate_at
的问题,但这解决了您的一般问题:
我首先去掉所有a和b都缺失的国家,然后为每个国家确定最小和最大年份,这是不缺失的。过滤完这些后,我使用na.fill
。
library(dplyr)
library(readr)
library(zoo)
country_data %>%
mutate(Year = parse_number(`country-year`)) %>%
group_by(country) %>%
mutate(not_all_na = any(!(is.na(a) & is.na(b)))) %>%
filter(not_all_na) %>%
mutate(Year_min_not_na = min(Year[!(is.na(a) & is.na(b))]),
Year_max_not_na = max(Year[!(is.na(a) & is.na(b))])) %>%
filter(Year >= Year_min_not_na, Year <= Year_max_not_na) %>%
mutate_at(vars(a:b), ~na.fill(.x, "extend"))
# A tibble: 6 x 8
# Groups: country [2]
# country `country-year` a b Year not_all_na Year_min_not_na Year_max_not_na
# <fct> <fct> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
# 1 France France2001 1000 1000 2001 TRUE 2001 2003
# 2 France France2002 1300 1600 2002 TRUE 2001 2003
# 3 France France2003 1600 2200 2003 TRUE 2001 2003
# 4 UK UK2000 1000 1000 2000 TRUE 2000 2002
# 5 UK UK2001 1000 1000 2001 TRUE 2000 2002
# 6 UK UK2002 1000 1000 2002 TRUE 2000 2002
数据
country_data <-
structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L),
.Label = c("France", "Germany", "UK"), class = "factor"),
country.year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 6L, 7L, 8L, 9L, 10L),
.Label = c("France2000", "France2001", "France2002", "France2003",
"France2004", "UK2000", "UK2001", "UK2002", "UK2003", "UK2004"),
class = "factor"),
a = c(NA, 1000L, NA, 1600L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA, NA),
b = c(NA, 1000L, NA, 2200L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -15L))
这是我的看法:
library(data.table)
library(tidyverse)
library(zoo)
df <- fread("
n country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
11 Germany UK2000 NA NA
12 Germany UK2001 NA NA
13 Germany UK2002 NA NA
14 Germany UK2003 NA NA
15 Germany UK2004 NA NA
") %>% select(-n)
# Clean data
df <- df %>%
mutate(year = str_extract_all(`country-year`, "[0-9]{4}$", simplify = T)) %>%
select(country, year, a, b)
# Remove all rows NA in a and b if there is no earlier
# or later row with value for a and b
# I hope this was what you meant with extrapolate :)
df <- df %>%
group_by(country) %>%
filter(year >= min(year[!is.na(a) | !is.na(b)]),
year <= max(year[!is.na(a) | !is.na(b)])) %>%
ungroup()
# Intrapolate
df %>%
mutate_at(vars(a:b), ~na.fill(., "extend"))
结果:
# A tibble: 6 x 4
country year a b
<chr> <chr> <dbl> <dbl>
1 France 2001 1000. 1000.
2 France 2002 1300. 1600.
3 France 2003 1600. 2200.
4 UK 2000 1000. 1000.
5 UK 2001 1000. 1000.
6 UK 2002 1000. 1000.
这是另外两种使用 filter
和 slice
的方法。第一种方法应该最接近 OP 的要求:
library(dplyr)
library(zoo)
df %>%
group_by(country) %>%
mutate_if(is.numeric, na.approx, na.rm = FALSE) %>%
filter(!is.na(a|b))
或 slice
:
df %>%
group_by(country) %>%
filter(any(!is.na(a|b))) %>%
slice(min(which(!is.na(a|b))):max(which(!is.na(a|b)))) %>%
mutate_if(is.numeric, na.approx)
结果:
# A tibble: 6 x 4
# Groups: country [2]
country country.year a b
<fct> <fct> <dbl> <dbl>
1 France France2001 1000 1000
2 France France2002 1300 1600
3 France France2003 1600 2200
4 UK UK2000 1000 1000
5 UK UK2001 1000 1000
6 UK UK2002 1000 1000
数据:
df <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L), .Label = c("France", "Germany",
"UK"), class = "factor"), country.year = structure(c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 6L, 7L, 8L, 9L, 10L), .Label = c("France2000",
"France2001", "France2002", "France2003", "France2004", "UK2000",
"UK2001", "UK2002", "UK2003", "UK2004"), class = "factor"), a = c(NA,
1000L, NA, 1600L, NA, 1000L, NA, 1000L, NA, NA, NA, NA, NA, NA,
NA), b = c(NA, 1000L, NA, 2200L, NA, 1000L, NA, 1000L, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("country", "country.year", "a",
"b"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"))
不幸的是,@kath onyl 的解决方案在给定示例中有效,但如果只有一列包含数据,则失败,例如:
country country-year year a b
France France2000 2000 NA NA
France France2001 2001 1000 1000
France France2002 2002 NA NA
France France2003 2003 1600 2200
France France2004 2004 NA NA
UK UK2000 2000 1000 1000
UK UK2001 2001 NA NA
UK UK2002 2002 1000 1000
UK UK2003 2003 NA NA
UK UK2004 2004 NA NA
Germany UK2000 2000 NA NA
Germany UK2001 2001 NA 500
Germany UK2002 2002 NA NA
Germany UK2003 2003 NA 1100
Germany UK2004 2004 NA NA
同样不幸的是,OPs 问题的答案是否定的,你不能混合使用 mutate_at 和 mutate_if(没有允许你指定 .predicate 和 .vars 的函数)
但是 您可以在 mutate_at 中使用的函数中使用预测函数。所以这是我使用包含预测函数的 mutate_at 的解决方案:
df %>%
group_by(country) %>%
# Interpolate if at least two non-null values are present
mutate_at(vars(a,b), funs(if(sum(!is.na(.))<2) {NA_real_} else{approx(year, ., year)$y})) %>%
# keep only rows with original or interpolated values in either column a or b
filter_at(vars(a,b), any_vars(!is.na(.)))
dplyr 0.8.3 灵感来自:
library(dplyr)
(iris [1:3,]
%>% mutate_at(c("Petal.Width"),
list(~ifelse(Sepal.Width == 3.5,
.+10,
.+100)
)
)
)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 10.2 setosa
#> 2 4.9 3.0 1.4 100.2 setosa
#> 3 4.7 3.2 1.3 100.2 setosa
新列总计:
library(dplyr)
(iris [1:3,]
%>% mutate_at(c("Petal.Width"),
list(toto=~ifelse(Sepal.Width == 3.5,
.+10,
.+100)
)
)
)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species toto
#> 1 5.1 3.5 1.4 0.2 setosa 10.2
#> 2 4.9 3.0 1.4 0.2 setosa 100.2
#> 3 4.7 3.2 1.3 0.2 setosa 100.2
由 reprex package (v0.2.1)
于 2019-07-30 创建