如何处理数据集中的合并和拆分变量
How to handle combined and split variables in dataset
我有数据(csv 中的当前数据)包含一个带有事件的变量(可能为空或最多包含 30 个由空格分隔的事件代码),然后是单独变量 ED1、ED2 中列出的每个事件的事件日期, ED3...
要从这些数据中获得有用的信息,我需要能够找到每个事件的日期。我的方法是将事件变量拆分为新行,但我对如何正确获取日期感到困扰。 (我正在使用 R,因为我稍后会用它来分析数据,但我在考虑是否可以切换到 SQL 来管理数据)。
为了简单起见,示例数据最多只有 5 个事件:
# Sample data
df <- data.frame(ID = 1:5,
E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
ED3 = as.Date(c("2004-09-19","","","","")),
ED4 = as.Date(c("2012-07-15","","","","")),
ED5 = as.Date(NA))
# New variable with all dates
df %>%
unite(ED, ED1:ED5, sep=" ", na.rm=T) -> df
# Separate rows
df %>%
separate_rows(E,ED,sep=" ") -> df
这适用于此样本数据集,但当我尝试将其应用于我的数据时出现错误:
Error: Incompatible lengths: 4, 2.
如果我是对的,这意味着 E
和 ED
分成不同的行数。所以我认为数据集缺少数据。我试图通过以下方式验证这一点:
df %>%
unite(ED,ED1:ED5,sep=" ", na.rm=T) %>%
mutate(E = strsplit(E," "), ED = strsplit(ED," ")) %>%
filter(length(E) != length(ED))
[1] ID E ED
<0 rows>
但是,如果我尝试分别在 E
或 ED
上 separate_rows()
,我会得到不同的行数。这就是我卡住的地方。
补充问题:
在另一个数据框中,我想为每个 ID
添加一个布尔值,如果 ID
已经参加了两个日期之间的特定事件类型或不基于此数据框。每个 ID
可以在事件数据框中出现多次,并且每个 ID 可以多次参加同一类型的事件。
问题显然是因为分隔 E
值的空格数量少于 ED
中的空格。为了解决这个问题,您可以拆分 E
列并用空字符串填充值。
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
# Sample data
df <- data.frame(ID = 1:5,
E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
ED3 = as.Date(c("2004-09-19","","","","")),
ED4 = as.Date(c("2012-07-15","","","","")),
ED5 = as.Date(NA),stringsAsFactors=F)
# testing
df %>%
mutate(E = strsplit(E," ")) %>%
# change 5 to 30 if you want to use this code on your data
filter(lengths(E) != 5)
#> ID E ED1 ED2 ED3 ED4 ED5
#> 1 1 FTT, JAD, AHN, TKZ 2016-04-01 2009-06-18 2004-09-19 2012-07-15 <NA>
#> 2 2 <NA> <NA> <NA> <NA> <NA>
#> 3 3 JAD, FTT 2014-12-31 2007-11-12 <NA> <NA> <NA>
#> 4 4 AJN 2019-05-15 <NA> <NA> <NA> <NA>
#> 5 5 TKZ, AHD 2005-05-04 2004-04-09 <NA> <NA> <NA>
df %>%
mutate( E=lapply(strsplit(E," "), function(x) c(x, rep("", 5-length(x))) )) -> df.split
### First method keeps numerical format for date and NAs
df.split %>%
nest(ED=starts_with("ED")) %>%
mutate(ED=lapply(ED, function(x) unlist(x[1,], use.names=FALSE))) %>%
unnest(c(E, ED))
#> # A tibble: 25 x 3
#> ID E ED
#> <int> <chr> <dbl>
#> 1 1 "FTT" 16892
#> 2 1 "JAD" 14413
#> 3 1 "AHN" 12680
#> 4 1 "TKZ" 15536
#> 5 1 "" NA
#> 6 2 "" NA
#> 7 2 "" NA
#> 8 2 "" NA
#> 9 2 "" NA
#> 10 2 "" NA
#> # … with 15 more rows
### Second method Everything is a string
df.split %>%
unite(ED,ED1:ED5,sep=" ", na.rm=T)%>%
mutate( ED = strsplit(ED," ")) %>%
unnest(c(E, ED))
#> # A tibble: 25 x 3
#> ID E ED
#> <int> <chr> <chr>
#> 1 1 "FTT" 16892
#> 2 1 "JAD" 14413
#> 3 1 "AHN" 12680
#> 4 1 "TKZ" 15536
#> 5 1 "" NA
#> 6 2 "" NA
#> 7 2 "" NA
#> 8 2 "" NA
#> 9 2 "" NA
#> 10 2 "" NA
#> # … with 15 more rows
全程整洁
df%>%
separate(E, into=paste0("E", 1:5), fill="right", sep=" ") %>%
unite(E, E1:E5,sep=" ") %>%
unite(ED, ED1:ED5,sep=" ") %>%
mutate(E=strsplit(E," "), ED=strsplit(ED," ")) %>%
unnest(c(E,ED)) %>% mutate(ED=as.Date(ED)) %>% filter(!is.na(ED))
#> ID E ED
#> <int> <chr> <date>
#> 1 1 FTT 2016-04-01
#> 2 1 JAD 2009-06-18
#> 3 1 AHN 2004-09-19
#> 4 1 TKZ 2012-07-15
#> 5 3 JAD 2014-12-31
#> 6 3 FTT 2007-11-12
#> 7 4 AJN 2019-05-15
#> 8 5 TKZ 2005-05-04
#> 9 5 AHD 2004-04-09
我有数据(csv 中的当前数据)包含一个带有事件的变量(可能为空或最多包含 30 个由空格分隔的事件代码),然后是单独变量 ED1、ED2 中列出的每个事件的事件日期, ED3...
要从这些数据中获得有用的信息,我需要能够找到每个事件的日期。我的方法是将事件变量拆分为新行,但我对如何正确获取日期感到困扰。 (我正在使用 R,因为我稍后会用它来分析数据,但我在考虑是否可以切换到 SQL 来管理数据)。
为了简单起见,示例数据最多只有 5 个事件:
# Sample data
df <- data.frame(ID = 1:5,
E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
ED3 = as.Date(c("2004-09-19","","","","")),
ED4 = as.Date(c("2012-07-15","","","","")),
ED5 = as.Date(NA))
# New variable with all dates
df %>%
unite(ED, ED1:ED5, sep=" ", na.rm=T) -> df
# Separate rows
df %>%
separate_rows(E,ED,sep=" ") -> df
这适用于此样本数据集,但当我尝试将其应用于我的数据时出现错误:
Error: Incompatible lengths: 4, 2.
如果我是对的,这意味着 E
和 ED
分成不同的行数。所以我认为数据集缺少数据。我试图通过以下方式验证这一点:
df %>%
unite(ED,ED1:ED5,sep=" ", na.rm=T) %>%
mutate(E = strsplit(E," "), ED = strsplit(ED," ")) %>%
filter(length(E) != length(ED))
[1] ID E ED
<0 rows>
但是,如果我尝试分别在 E
或 ED
上 separate_rows()
,我会得到不同的行数。这就是我卡住的地方。
补充问题:
在另一个数据框中,我想为每个 ID
添加一个布尔值,如果 ID
已经参加了两个日期之间的特定事件类型或不基于此数据框。每个 ID
可以在事件数据框中出现多次,并且每个 ID 可以多次参加同一类型的事件。
问题显然是因为分隔 E
值的空格数量少于 ED
中的空格。为了解决这个问题,您可以拆分 E
列并用空字符串填充值。
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
# Sample data
df <- data.frame(ID = 1:5,
E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
ED3 = as.Date(c("2004-09-19","","","","")),
ED4 = as.Date(c("2012-07-15","","","","")),
ED5 = as.Date(NA),stringsAsFactors=F)
# testing
df %>%
mutate(E = strsplit(E," ")) %>%
# change 5 to 30 if you want to use this code on your data
filter(lengths(E) != 5)
#> ID E ED1 ED2 ED3 ED4 ED5
#> 1 1 FTT, JAD, AHN, TKZ 2016-04-01 2009-06-18 2004-09-19 2012-07-15 <NA>
#> 2 2 <NA> <NA> <NA> <NA> <NA>
#> 3 3 JAD, FTT 2014-12-31 2007-11-12 <NA> <NA> <NA>
#> 4 4 AJN 2019-05-15 <NA> <NA> <NA> <NA>
#> 5 5 TKZ, AHD 2005-05-04 2004-04-09 <NA> <NA> <NA>
df %>%
mutate( E=lapply(strsplit(E," "), function(x) c(x, rep("", 5-length(x))) )) -> df.split
### First method keeps numerical format for date and NAs
df.split %>%
nest(ED=starts_with("ED")) %>%
mutate(ED=lapply(ED, function(x) unlist(x[1,], use.names=FALSE))) %>%
unnest(c(E, ED))
#> # A tibble: 25 x 3
#> ID E ED
#> <int> <chr> <dbl>
#> 1 1 "FTT" 16892
#> 2 1 "JAD" 14413
#> 3 1 "AHN" 12680
#> 4 1 "TKZ" 15536
#> 5 1 "" NA
#> 6 2 "" NA
#> 7 2 "" NA
#> 8 2 "" NA
#> 9 2 "" NA
#> 10 2 "" NA
#> # … with 15 more rows
### Second method Everything is a string
df.split %>%
unite(ED,ED1:ED5,sep=" ", na.rm=T)%>%
mutate( ED = strsplit(ED," ")) %>%
unnest(c(E, ED))
#> # A tibble: 25 x 3
#> ID E ED
#> <int> <chr> <chr>
#> 1 1 "FTT" 16892
#> 2 1 "JAD" 14413
#> 3 1 "AHN" 12680
#> 4 1 "TKZ" 15536
#> 5 1 "" NA
#> 6 2 "" NA
#> 7 2 "" NA
#> 8 2 "" NA
#> 9 2 "" NA
#> 10 2 "" NA
#> # … with 15 more rows
全程整洁
df%>%
separate(E, into=paste0("E", 1:5), fill="right", sep=" ") %>%
unite(E, E1:E5,sep=" ") %>%
unite(ED, ED1:ED5,sep=" ") %>%
mutate(E=strsplit(E," "), ED=strsplit(ED," ")) %>%
unnest(c(E,ED)) %>% mutate(ED=as.Date(ED)) %>% filter(!is.na(ED))
#> ID E ED
#> <int> <chr> <date>
#> 1 1 FTT 2016-04-01
#> 2 1 JAD 2009-06-18
#> 3 1 AHN 2004-09-19
#> 4 1 TKZ 2012-07-15
#> 5 3 JAD 2014-12-31
#> 6 3 FTT 2007-11-12
#> 7 4 AJN 2019-05-15
#> 8 5 TKZ 2005-05-04
#> 9 5 AHD 2004-04-09