分区数据框/重塑和堆叠
Partitioned dataframe / reshape and stacking
我有一个数据框,基本上是这样划分的:
Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("total", 1:3)
y2 <- c(NA, 4:6)
y3 <- c(NA, 7:9)
df <- data.frame(Geo, y1, y2, y3)
Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("60 years", 9:11)
y2 <- c(NA,12:14)
y3 <- c(NA,15:17)
df2 <- data.frame(Geo,y1,y2,y3)
# shape
df <- rbind(df,df2)
因此,我的数据框如下所示:
Geo y1 y2 y3
1 AGE total NA NA
2 region1 1 4 7
3 region2 2 5 8
4 region3 3 6 9
5 AGE 60 years NA NA
6 region1 9 12 15
7 region2 10 13 16
8 region3 11 14 17
如您所见,我的数据框基本上分为两部分,其中 "AGE" 是划分此数据框的有效行。我想拆开这些块并将它们放在这样的工作格式中:
我的范围
Geo year value Age
1 region1 y1 1 total
2 region1 y2 4 total
3 region1 y3 7 total
4 region2 y1 2 total
5 region2 y2 5 total
6 region2 y3 8 total
7 region3 y1 3 total
8 region3 y2 6 total
9 region3 y3 9 total
10 region1 y1 9 60 years
11 region1 y2 12 60 years
12 region1 y3 15 60 years
13 region2 y1 10 60 years
14 region2 y2 13 60 years
15 region2 y3 16 60 years
16 region3 y1 11 60 years
17 region3 y2 14 60 years
18 region3 y3 17 60 years
谁能提供一种快速有效的方法来执行此操作,因为我的原始数据框限制了数千个数据。
所以你的数据有点像噩梦!使用一些基本的 dplyr
munging 和 tidyr
工具可以相对容易地完成,如下所示,
Geo<-c("AGE","region1","region2","region3")
y1 <-c("total",1:3)
y2 <-c(NA,4:6)
y3 <-c(NA,7:9)
df<-data.frame(Geo,y1,y2,y3)
Geo<-c("AGE","region1","region2","region3")
y1 <-c("60 years",9:11)
y2 <-c(NA,12:14)
y3 <-c(NA,15:17)
df2<-data.frame(Geo,y1,y2,y3)
# shape
df <- rbind(df,df2)
## Add age as a variable - this assumes the same number of regions for all ages
## Find all age rows and pull unique age values
library(dplyr)
library(tidyr)
library(magrittr)
library(purrr)
ages <- df %>%
filter(Geo %in% "AGE") %>%
pull(y1)
no_regions <- df %>%
filter(grepl("region", Geo)) %>%
pull(Geo) %>%
unique() %>%
length()
# Add age variable, drop Age blocks, gather variables, and arrange data
df_tidy <- df %>%
mutate(age = ages %>%
as.character %>%
map(rep, no_regions + 1) %>%
unlist) %>%
filter(!(Geo %in% "AGE")) %>%
gather(key = "variable", value = "value", y1, y2, y3) %>%
arrange(desc(age), Geo)
注意:此解决方案仅适用于每个年龄段的区域数量相同的情况。如果不是这种情况,则需要更复杂的东西(比如在每个年龄段添加一个变量,然后循环添加年龄变量)如果是这种情况,请告诉我,我将编辑答案。
改进
基于 Jaap 出色的基础 R 答案,我概括了我的 tidyverse
解决方案。现在,无论区域数量如何,这都有效,zoo::na.locf
是一个很棒的功能!
library(dplyr)
library(tidyr)
library(magrittr)
library(zoo)
df_tidy <- df %>%
mutate(age = ifelse(Geo %in% "AGE", as.character(.$y1), NA) %>%
na.locf) %>%
filter(!(Geo %in% "AGE")) %>%
gather(key = "variable", value = "value", -Geo, -age) %>%
arrange(desc(age), Geo)
这给出了以下内容:
基于 R 的解决方案(带有一点 zoo
):
# creat a new 'age' column with only values in the rows
# that have an 'age'-value in `y1`
df$age[df$Geo == "AGE"] <- as.character(df$y1[df$Geo == "AGE"])
# fill the missing values with 'na.locf' from the 'zoo'-package
df$age <- zoo::na.locf(df$age)
# filter out the rows with "AGE" in 'Geo'
df <- df[df$Geo != "AGE",]
# now convert 'y1' to integers
df$y1 <- as.integer(as.character(df$y1))
# reshape into long format and set the rownames to just a numeric index
df2 <- reshape(df, direction = "long", idvar = c("Geo","age"),
varying = c("y1","y2","y3"), timevar = 'year',
v.names = "value", times = c("y1","y2","y3"))
rownames(df2) <- NULL
给出:
> df2
Geo age year value
1 region1 total y1 1
2 region2 total y1 2
3 region3 total y1 3
4 region1 60 years y1 9
5 region2 60 years y1 10
6 region3 60 years y1 11
7 region1 total y2 4
8 region2 total y2 5
9 region3 total y2 6
10 region1 60 years y2 12
11 region2 60 years y2 13
12 region3 60 years y2 14
13 region1 total y3 7
14 region2 total y3 8
15 region3 total y3 9
16 region1 60 years y3 15
17 region2 60 years y3 16
18 region3 60 years y3 17
我有一个数据框,基本上是这样划分的:
Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("total", 1:3)
y2 <- c(NA, 4:6)
y3 <- c(NA, 7:9)
df <- data.frame(Geo, y1, y2, y3)
Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("60 years", 9:11)
y2 <- c(NA,12:14)
y3 <- c(NA,15:17)
df2 <- data.frame(Geo,y1,y2,y3)
# shape
df <- rbind(df,df2)
因此,我的数据框如下所示:
Geo y1 y2 y3
1 AGE total NA NA
2 region1 1 4 7
3 region2 2 5 8
4 region3 3 6 9
5 AGE 60 years NA NA
6 region1 9 12 15
7 region2 10 13 16
8 region3 11 14 17
如您所见,我的数据框基本上分为两部分,其中 "AGE" 是划分此数据框的有效行。我想拆开这些块并将它们放在这样的工作格式中:
我的范围
Geo year value Age
1 region1 y1 1 total
2 region1 y2 4 total
3 region1 y3 7 total
4 region2 y1 2 total
5 region2 y2 5 total
6 region2 y3 8 total
7 region3 y1 3 total
8 region3 y2 6 total
9 region3 y3 9 total
10 region1 y1 9 60 years
11 region1 y2 12 60 years
12 region1 y3 15 60 years
13 region2 y1 10 60 years
14 region2 y2 13 60 years
15 region2 y3 16 60 years
16 region3 y1 11 60 years
17 region3 y2 14 60 years
18 region3 y3 17 60 years
谁能提供一种快速有效的方法来执行此操作,因为我的原始数据框限制了数千个数据。
所以你的数据有点像噩梦!使用一些基本的 dplyr
munging 和 tidyr
工具可以相对容易地完成,如下所示,
Geo<-c("AGE","region1","region2","region3")
y1 <-c("total",1:3)
y2 <-c(NA,4:6)
y3 <-c(NA,7:9)
df<-data.frame(Geo,y1,y2,y3)
Geo<-c("AGE","region1","region2","region3")
y1 <-c("60 years",9:11)
y2 <-c(NA,12:14)
y3 <-c(NA,15:17)
df2<-data.frame(Geo,y1,y2,y3)
# shape
df <- rbind(df,df2)
## Add age as a variable - this assumes the same number of regions for all ages
## Find all age rows and pull unique age values
library(dplyr)
library(tidyr)
library(magrittr)
library(purrr)
ages <- df %>%
filter(Geo %in% "AGE") %>%
pull(y1)
no_regions <- df %>%
filter(grepl("region", Geo)) %>%
pull(Geo) %>%
unique() %>%
length()
# Add age variable, drop Age blocks, gather variables, and arrange data
df_tidy <- df %>%
mutate(age = ages %>%
as.character %>%
map(rep, no_regions + 1) %>%
unlist) %>%
filter(!(Geo %in% "AGE")) %>%
gather(key = "variable", value = "value", y1, y2, y3) %>%
arrange(desc(age), Geo)
注意:此解决方案仅适用于每个年龄段的区域数量相同的情况。如果不是这种情况,则需要更复杂的东西(比如在每个年龄段添加一个变量,然后循环添加年龄变量)如果是这种情况,请告诉我,我将编辑答案。
改进
基于 Jaap 出色的基础 R 答案,我概括了我的 tidyverse
解决方案。现在,无论区域数量如何,这都有效,zoo::na.locf
是一个很棒的功能!
library(dplyr)
library(tidyr)
library(magrittr)
library(zoo)
df_tidy <- df %>%
mutate(age = ifelse(Geo %in% "AGE", as.character(.$y1), NA) %>%
na.locf) %>%
filter(!(Geo %in% "AGE")) %>%
gather(key = "variable", value = "value", -Geo, -age) %>%
arrange(desc(age), Geo)
这给出了以下内容:
基于 R 的解决方案(带有一点 zoo
):
# creat a new 'age' column with only values in the rows
# that have an 'age'-value in `y1`
df$age[df$Geo == "AGE"] <- as.character(df$y1[df$Geo == "AGE"])
# fill the missing values with 'na.locf' from the 'zoo'-package
df$age <- zoo::na.locf(df$age)
# filter out the rows with "AGE" in 'Geo'
df <- df[df$Geo != "AGE",]
# now convert 'y1' to integers
df$y1 <- as.integer(as.character(df$y1))
# reshape into long format and set the rownames to just a numeric index
df2 <- reshape(df, direction = "long", idvar = c("Geo","age"),
varying = c("y1","y2","y3"), timevar = 'year',
v.names = "value", times = c("y1","y2","y3"))
rownames(df2) <- NULL
给出:
> df2 Geo age year value 1 region1 total y1 1 2 region2 total y1 2 3 region3 total y1 3 4 region1 60 years y1 9 5 region2 60 years y1 10 6 region3 60 years y1 11 7 region1 total y2 4 8 region2 total y2 5 9 region3 total y2 6 10 region1 60 years y2 12 11 region2 60 years y2 13 12 region3 60 years y2 14 13 region1 total y3 7 14 region2 total y3 8 15 region3 total y3 9 16 region1 60 years y3 15 17 region2 60 years y3 16 18 region3 60 years y3 17