分区数据框/重塑和堆叠

Question

我有一个数据框，基本上是这样划分的：

Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("total", 1:3)
y2 <- c(NA, 4:6)
y3 <- c(NA, 7:9)
df <- data.frame(Geo, y1, y2, y3)

Geo <- c("AGE", "region1", "region2", "region3")
y1 <- c("60 years", 9:11)
y2 <- c(NA,12:14)
y3 <- c(NA,15:17)
df2 <- data.frame(Geo,y1,y2,y3)

# shape 
df <- rbind(df,df2)

因此，我的数据框如下所示：

    Geo       y1 y2 y3
1     AGE    total NA NA
2 region1        1  4  7
3 region2        2  5  8
4 region3        3  6  9
5     AGE 60 years NA NA
6 region1        9 12 15
7 region2       10 13 16
8 region3       11 14 17

如您所见，我的数据框基本上分为两部分，其中 "AGE" 是划分此数据框的有效行。我想拆开这些块并将它们放在这样的工作格式中：

我的范围

   Geo year value      Age
1  region1   y1     1    total
2  region1   y2     4    total
3  region1   y3     7    total
4  region2   y1     2    total
5  region2   y2     5    total
6  region2   y3     8    total
7  region3   y1     3    total
8  region3   y2     6    total
9  region3   y3     9    total
10 region1   y1     9 60 years
11 region1   y2    12 60 years
12 region1   y3    15 60 years
13 region2   y1    10 60 years
14 region2   y2    13 60 years
15 region2   y3    16 60 years
16 region3   y1    11 60 years
17 region3   y2    14 60 years
18 region3   y3    17 60 years

谁能提供一种快速有效的方法来执行此操作，因为我的原始数据框限制了数千个数据。

Answer 1

所以你的数据有点像噩梦！使用一些基本的 dplyr munging 和 tidyr 工具可以相对容易地完成，如下所示，

Geo<-c("AGE","region1","region2","region3")
y1 <-c("total",1:3)
y2 <-c(NA,4:6)
y3 <-c(NA,7:9)
df<-data.frame(Geo,y1,y2,y3)

Geo<-c("AGE","region1","region2","region3")
y1 <-c("60 years",9:11)
y2 <-c(NA,12:14)
y3 <-c(NA,15:17)
df2<-data.frame(Geo,y1,y2,y3)

# shape 
df <- rbind(df,df2)

## Add age as a variable - this assumes the same number of regions for all ages
## Find all age rows and pull unique age values
library(dplyr)
library(tidyr)
library(magrittr)
library(purrr)

ages <- df %>% 
  filter(Geo %in% "AGE") %>% 
  pull(y1)

no_regions <- df %>% 
  filter(grepl("region", Geo)) %>% 
  pull(Geo) %>% 
  unique() %>% 
  length()

# Add age variable, drop Age blocks, gather variables, and arrange data
df_tidy <- df %>% 
  mutate(age = ages %>% 
           as.character %>% 
           map(rep, no_regions + 1) %>% 
           unlist) %>% 
  filter(!(Geo %in% "AGE")) %>% 
  gather(key = "variable", value = "value", y1, y2, y3) %>% 
  arrange(desc(age), Geo)

注意：此解决方案仅适用于每个年龄段的区域数量相同的情况。如果不是这种情况，则需要更复杂的东西（比如在每个年龄段添加一个变量，然后循环添加年龄变量）如果是这种情况，请告诉我，我将编辑答案。

改进

基于 Jaap 出色的基础 R 答案，我概括了我的 tidyverse 解决方案。现在，无论区域数量如何，这都有效，zoo::na.locf 是一个很棒的功能！

library(dplyr)
library(tidyr)
library(magrittr)
library(zoo)


df_tidy <- df %>% 
  mutate(age = ifelse(Geo %in% "AGE", as.character(.$y1), NA) %>% 
           na.locf) %>% 
  filter(!(Geo %in% "AGE")) %>% 
  gather(key = "variable", value = "value", -Geo, -age) %>% 
  arrange(desc(age), Geo)

这给出了以下内容：

Answer 2

基于 R 的解决方案（带有一点 zoo）：

# creat a new 'age' column with only values in the rows
# that have an 'age'-value in `y1`
df$age[df$Geo == "AGE"] <- as.character(df$y1[df$Geo == "AGE"])

# fill the missing values with 'na.locf' from the 'zoo'-package
df$age <- zoo::na.locf(df$age)

# filter out the rows with "AGE" in 'Geo'
df <- df[df$Geo != "AGE",]

# now convert 'y1' to integers
df$y1 <- as.integer(as.character(df$y1))

# reshape into long format and set the rownames to just a numeric index
df2 <- reshape(df, direction = "long", idvar = c("Geo","age"),
               varying = c("y1","y2","y3"), timevar = 'year',
               v.names = "value", times = c("y1","y2","y3"))
rownames(df2) <- NULL

给出：

> df2
       Geo      age year value
1  region1    total   y1     1
2  region2    total   y1     2
3  region3    total   y1     3
4  region1 60 years   y1     9
5  region2 60 years   y1    10
6  region3 60 years   y1    11
7  region1    total   y2     4
8  region2    total   y2     5
9  region3    total   y2     6
10 region1 60 years   y2    12
11 region2 60 years   y2    13
12 region3 60 years   y2    14
13 region1    total   y3     7
14 region2    total   y3     8
15 region3    total   y3     9
16 region1 60 years   y3    15
17 region2 60 years   y3    16
18 region3 60 years   y3    17

分区数据框/重塑和堆叠

Partitioned dataframe / reshape and stacking

partitioning

r

data-manipulation

reshape2

data-cleaning