通过突变一些变量并保持一些变量与以前相同,将宽数据转换为长数据
Transfprm wide data into long by mutationg some variable and keeping some variable as same as before
我有一个包含 6000 个村庄的数据集,其中包含人口等变量(我根据基准年保持不变)。还有三个变量 project_1、project_2 和 project_3 给出了该项目在该村实施时的详细信息。这就是数据的样子。
| village | population | project_1 | project_2 | project_3 |
|---------|------------|-----------|-----------|-----------|
| A | 100 | 2002 | | |
| B | 200 | | 2003 | 2002 |
| C | 150 | 2004 | | |
| D | 175 | | | 2005 |
我想将此数据转换为长格式(见下文)。所以基本上项目变量变成了一个虚拟变量,在项目实施时取值为 1,此后保持等于 1。
| village | population | year | project_1 | project_2 | project_3 |
|---------|------------|------|-----------|-----------|-----------|
| A | 100 | 2001 | 0 | 0 | 0 |
| A | 100 | 2002 | 1 | 0 | 0 |
| A | 100 | 2003 | 1 | 0 | 0 |
| A | 100 | 2004 | 1 | 0 | 0 |
| A | 100 | 2005 | 1 | 0 | 0 |
| B | 200 | 2001 | 0 | 0 | 0 |
| B | 200 | 2002 | 0 | 0 | 1 |
| B | 200 | 2003 | 0 | 1 | 1 |
| B | 200 | 2004 | 0 | 1 | 1 |
| B | 200 | 2005 | 0 | 1 | 1 |
| C | 150 | 2001 | 0 | 0 | 0 |
| C | 150 | 2002 | 0 | 0 | 0 |
| C | 150 | 2003 | 0 | 0 | 0 |
| C | 150 | 2004 | 1 | 0 | 0 |
| C | 150 | 2005 | 1 | 0 | 0 |
| D | 175 | 2001 | 0 | 0 | 0 |
| D | 175 | 2002 | 0 | 0 | 0 |
| D | 175 | 2003 | 0 | 0 | 0 |
| D | 175 | 2004 | 0 | 0 | 0 |
| D | 175 | 2005 | 0 | 0 | 1 |
我已经尝试过这段代码,但到目前为止它还行不通。
temp_long <- reshape(data = temp,
idvar= "village",
varying = 3:5,
sep= "",
timevar= "year",
times = c(2001,2002,2003,2004,2005),
new.row.names= 1:100000,
direction = "long")
假设您的数据是:
df <- read_table2("village population project_1 project_2 project_3
A 100 2002 NA NA
B 200 NA 2003 2002
C 150 2004 NA NA
D 175 NA NA 2005")
使用dplyr
:
df %>%
merge(expand.grid(year=2001:2005, village=.$village), by="village") %>%
mutate(across(starts_with("project_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(village, population, year, starts_with("pro"))
产量
village population year project_1 project_2 project_3
1 A 100 2001 0 0 0
2 A 100 2002 1 0 0
3 A 100 2003 1 0 0
4 A 100 2004 1 0 0
5 A 100 2005 1 0 0
6 B 200 2001 0 0 0
7 B 200 2002 0 0 1
8 B 200 2003 0 1 1
9 B 200 2004 0 1 1
10 B 200 2005 0 1 1
11 C 150 2001 0 0 0
12 C 150 2002 0 0 0
13 C 150 2003 0 0 0
14 C 150 2004 1 0 0
15 C 150 2005 1 0 0
16 D 175 2001 0 0 0
17 D 175 2002 0 0 0
18 D 175 2003 0 0 0
19 D 175 2004 0 0 0
20 D 175 2005 0 0 1
用你的 dput
数据
df2 <- structure(list(key = c("057091", "057296", "057802", "057806", "058105", "058309"), TOT_POP = c(795, 378, 669, 3760, 55, 933 ), road_comp_date_upg_year_final = c(2009, 2004, 2006, 2006, 2008, 2012), road_award_date_upg_year_final = c(2008, 2003, 2005, 2005, 2007, 2010), road_comp_date_stip_upg_year_final = c(2009, 2003, 2006, 2006, 2008, 2011)), row.names = c(NA, 6L), class = "data.frame")
和调整后的代码
df2 %>%
merge(expand.grid(year=2001:2015, key=.$key), by="key") %>%
mutate(across(starts_with("road_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(key, TOT_POP, year, starts_with("road"))
创造
key TOT_POP year road_comp_date_upg_year_final road_award_date_upg_year_final road_comp_date_stip_upg_year_final
1 057091 795 2001 0 0 0
2 057091 795 2002 0 0 0
3 057091 795 2003 0 0 0
4 057091 795 2004 0 0 0
5 057091 795 2005 0 0 0
6 057091 795 2006 0 0 0
7 057091 795 2007 0 0 0
8 057091 795 2008 0 1 0
9 057091 795 2009 1 1 1
10 057091 795 2010 1 1 1
11 057091 795 2011 1 1 1
12 057091 795 2012 1 1 1
13 057091 795 2013 1 1 1
14 057091 795 2014 1 1 1
15 057091 795 2015 1 1 1
16 057296 378 2001 0 0 0
17 057296 378 2002 0 0 0
18 057296 378 2003 0 1 1
19 057296 378 2004 1 1 1
20 057296 378 2005 1 1 1
我有一个包含 6000 个村庄的数据集,其中包含人口等变量(我根据基准年保持不变)。还有三个变量 project_1、project_2 和 project_3 给出了该项目在该村实施时的详细信息。这就是数据的样子。
| village | population | project_1 | project_2 | project_3 |
|---------|------------|-----------|-----------|-----------|
| A | 100 | 2002 | | |
| B | 200 | | 2003 | 2002 |
| C | 150 | 2004 | | |
| D | 175 | | | 2005 |
我想将此数据转换为长格式(见下文)。所以基本上项目变量变成了一个虚拟变量,在项目实施时取值为 1,此后保持等于 1。
| village | population | year | project_1 | project_2 | project_3 |
|---------|------------|------|-----------|-----------|-----------|
| A | 100 | 2001 | 0 | 0 | 0 |
| A | 100 | 2002 | 1 | 0 | 0 |
| A | 100 | 2003 | 1 | 0 | 0 |
| A | 100 | 2004 | 1 | 0 | 0 |
| A | 100 | 2005 | 1 | 0 | 0 |
| B | 200 | 2001 | 0 | 0 | 0 |
| B | 200 | 2002 | 0 | 0 | 1 |
| B | 200 | 2003 | 0 | 1 | 1 |
| B | 200 | 2004 | 0 | 1 | 1 |
| B | 200 | 2005 | 0 | 1 | 1 |
| C | 150 | 2001 | 0 | 0 | 0 |
| C | 150 | 2002 | 0 | 0 | 0 |
| C | 150 | 2003 | 0 | 0 | 0 |
| C | 150 | 2004 | 1 | 0 | 0 |
| C | 150 | 2005 | 1 | 0 | 0 |
| D | 175 | 2001 | 0 | 0 | 0 |
| D | 175 | 2002 | 0 | 0 | 0 |
| D | 175 | 2003 | 0 | 0 | 0 |
| D | 175 | 2004 | 0 | 0 | 0 |
| D | 175 | 2005 | 0 | 0 | 1 |
我已经尝试过这段代码,但到目前为止它还行不通。
temp_long <- reshape(data = temp,
idvar= "village",
varying = 3:5,
sep= "",
timevar= "year",
times = c(2001,2002,2003,2004,2005),
new.row.names= 1:100000,
direction = "long")
假设您的数据是:
df <- read_table2("village population project_1 project_2 project_3
A 100 2002 NA NA
B 200 NA 2003 2002
C 150 2004 NA NA
D 175 NA NA 2005")
使用dplyr
:
df %>%
merge(expand.grid(year=2001:2005, village=.$village), by="village") %>%
mutate(across(starts_with("project_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(village, population, year, starts_with("pro"))
产量
village population year project_1 project_2 project_3
1 A 100 2001 0 0 0
2 A 100 2002 1 0 0
3 A 100 2003 1 0 0
4 A 100 2004 1 0 0
5 A 100 2005 1 0 0
6 B 200 2001 0 0 0
7 B 200 2002 0 0 1
8 B 200 2003 0 1 1
9 B 200 2004 0 1 1
10 B 200 2005 0 1 1
11 C 150 2001 0 0 0
12 C 150 2002 0 0 0
13 C 150 2003 0 0 0
14 C 150 2004 1 0 0
15 C 150 2005 1 0 0
16 D 175 2001 0 0 0
17 D 175 2002 0 0 0
18 D 175 2003 0 0 0
19 D 175 2004 0 0 0
20 D 175 2005 0 0 1
用你的 dput
数据
df2 <- structure(list(key = c("057091", "057296", "057802", "057806", "058105", "058309"), TOT_POP = c(795, 378, 669, 3760, 55, 933 ), road_comp_date_upg_year_final = c(2009, 2004, 2006, 2006, 2008, 2012), road_award_date_upg_year_final = c(2008, 2003, 2005, 2005, 2007, 2010), road_comp_date_stip_upg_year_final = c(2009, 2003, 2006, 2006, 2008, 2011)), row.names = c(NA, 6L), class = "data.frame")
和调整后的代码
df2 %>%
merge(expand.grid(year=2001:2015, key=.$key), by="key") %>%
mutate(across(starts_with("road_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(key, TOT_POP, year, starts_with("road"))
创造
key TOT_POP year road_comp_date_upg_year_final road_award_date_upg_year_final road_comp_date_stip_upg_year_final
1 057091 795 2001 0 0 0
2 057091 795 2002 0 0 0
3 057091 795 2003 0 0 0
4 057091 795 2004 0 0 0
5 057091 795 2005 0 0 0
6 057091 795 2006 0 0 0
7 057091 795 2007 0 0 0
8 057091 795 2008 0 1 0
9 057091 795 2009 1 1 1
10 057091 795 2010 1 1 1
11 057091 795 2011 1 1 1
12 057091 795 2012 1 1 1
13 057091 795 2013 1 1 1
14 057091 795 2014 1 1 1
15 057091 795 2015 1 1 1
16 057296 378 2001 0 0 0
17 057296 378 2002 0 0 0
18 057296 378 2003 0 1 1
19 057296 378 2004 1 1 1
20 057296 378 2005 1 1 1