如何处理数据集中的合并和拆分变量

How to handle combined and split variables in dataset

我有数据(csv 中的当前数据)包含一个带有事件的变量(可能为空或最多包含 30 个由空格分隔的事件代码),然后是单独变量 ED1、ED2 中列出的每个事件的事件日期, ED3...

要从这些数据中获得有用的信息,我需要能够找到每个事件的日期。我的方法是将事件变量拆分为新行,但我对如何正确获取日期感到困扰。 (我正在使用 R,因为我稍后会用它来分析数据,但我在考虑是否可以切换到 SQL 来管理数据)。

为了简单起见,示例数据最多只有 5 个事件:

# Sample data
df <- data.frame(ID = 1:5,
                 E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
                 ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
                 ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
                 ED3 = as.Date(c("2004-09-19","","","","")),
                 ED4 = as.Date(c("2012-07-15","","","","")),
                 ED5 = as.Date(NA))
# New variable with all dates
df %>%
  unite(ED, ED1:ED5, sep=" ", na.rm=T) -> df

# Separate rows
df %>%
  separate_rows(E,ED,sep=" ") -> df

这适用于此样本数据集,但当我尝试将其应用于我的数据时出现错误:

Error: Incompatible lengths: 4, 2.

如果我是对的,这意味着 EED 分成不同的行数。所以我认为数据集缺少数据。我试图通过以下方式验证这一点:

df %>% 
  unite(ED,ED1:ED5,sep=" ", na.rm=T) %>%
  mutate(E = strsplit(E," "), ED = strsplit(ED," ")) %>% 
  filter(length(E) != length(ED))

[1] ID E  ED
<0 rows>

但是,如果我尝试分别在 EEDseparate_rows(),我会得到不同的行数。这就是我卡住的地方。

补充问题: 在另一个数据框中,我想为每个 ID 添加一个布尔值,如果 ID 已经参加了两个日期之间的特定事件类型或不基于此数据框。每个 ID 可以在事件数据框中出现多次,并且每个 ID 可以多次参加同一类型的事件。

问题显然是因为分隔 E 值的空格数量少于 ED 中的空格。为了解决这个问题,您可以拆分 E 列并用空字符串填充值。

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2

# Sample data
df <- data.frame(ID = 1:5,
                 E = c("FTT JAD AHN TKZ","", "JAD FTT", "AJN", "TKZ AHD"),
                 ED1 = as.Date(c("2016-04-01","","2014-12-31","2019-05-15","2005-05-04")),
                 ED2 = as.Date(c("2009-06-18","","2007-11-12","","2004-04-09")),
                 ED3 = as.Date(c("2004-09-19","","","","")),
                 ED4 = as.Date(c("2012-07-15","","","","")),
                 ED5 = as.Date(NA),stringsAsFactors=F)
# testing 
df %>% 
  mutate(E = strsplit(E," ")) %>% 
  # change 5 to 30 if you want to use this code on your data
  filter(lengths(E) != 5)
#>   ID                  E        ED1        ED2        ED3        ED4  ED5
#> 1  1 FTT, JAD, AHN, TKZ 2016-04-01 2009-06-18 2004-09-19 2012-07-15 <NA>
#> 2  2                          <NA>       <NA>       <NA>       <NA> <NA>
#> 3  3           JAD, FTT 2014-12-31 2007-11-12       <NA>       <NA> <NA>
#> 4  4                AJN 2019-05-15       <NA>       <NA>       <NA> <NA>
#> 5  5           TKZ, AHD 2005-05-04 2004-04-09       <NA>       <NA> <NA>

df %>% 
   mutate( E=lapply(strsplit(E," "), function(x) c(x, rep("", 5-length(x))) )) -> df.split 

### First method keeps numerical format for date and NAs
df.split %>% 
    nest(ED=starts_with("ED"))  %>% 
    mutate(ED=lapply(ED, function(x) unlist(x[1,], use.names=FALSE))) %>%
    unnest(c(E, ED))
#> # A tibble: 25 x 3
#>       ID E        ED
#>    <int> <chr> <dbl>
#>  1     1 "FTT" 16892
#>  2     1 "JAD" 14413
#>  3     1 "AHN" 12680
#>  4     1 "TKZ" 15536
#>  5     1 ""       NA
#>  6     2 ""       NA
#>  7     2 ""       NA
#>  8     2 ""       NA
#>  9     2 ""       NA
#> 10     2 ""       NA
#> # … with 15 more rows

### Second method Everything is a string
df.split %>%
  unite(ED,ED1:ED5,sep=" ", na.rm=T)%>%
  mutate( ED = strsplit(ED," ")) %>%
  unnest(c(E, ED))
#> # A tibble: 25 x 3
#>       ID E     ED   
#>    <int> <chr> <chr>
#>  1     1 "FTT" 16892
#>  2     1 "JAD" 14413
#>  3     1 "AHN" 12680
#>  4     1 "TKZ" 15536
#>  5     1 ""    NA   
#>  6     2 ""    NA   
#>  7     2 ""    NA   
#>  8     2 ""    NA   
#>  9     2 ""    NA   
#> 10     2 ""    NA   
#> # … with 15 more rows

全程整洁

df%>% 
    separate(E, into=paste0("E", 1:5), fill="right", sep=" ") %>%
    unite(E, E1:E5,sep=" ") %>%
    unite(ED, ED1:ED5,sep=" ") %>%
    mutate(E=strsplit(E," "), ED=strsplit(ED," ")) %>%
    unnest(c(E,ED)) %>% mutate(ED=as.Date(ED)) %>% filter(!is.na(ED))
#>      ID E     ED        
#>   <int> <chr> <date>     
#> 1     1 FTT   2016-04-01
#> 2     1 JAD   2009-06-18
#> 3     1 AHN   2004-09-19
#> 4     1 TKZ   2012-07-15
#> 5     3 JAD   2014-12-31
#> 6     3 FTT   2007-11-12
#> 7     4 AJN   2019-05-15
#> 8     5 TKZ   2005-05-04
#> 9     5 AHD   2004-04-09