从每列两个变量和隐式缺失的不整洁数据集创建整洁数据集

Question

我有一个不整洁的数据集，它在两列中的每一列中组合了两个变量（一些缺失）（下面数据框中的一个小子样本 'test'）。我正在努力创建下面所需的整洁数据集。

untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", 
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%", 
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")

需要的数据框

N_patients  N_ears  pct_patients  pct_ears
173         NA      58.61           NA
 60         NA      13.30           NA
 54         96      11.11           NA
168        328      14.79        10.45

谢谢！

似乎总是存在边缘情况——两个答案都没有考虑到第 5 行的某些内容。似乎只是一个正则表达式问题。有关如何修复的建议？

untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", 
                                          "906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%", 
                                                                          "52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
                                                                          ), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
                                                                          ))

即。第 5 行，[35.55%] 被解析为 pct_patients

   N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1       173     58.61%        173     NA        58.61       NA
2        60     13.30%         60     NA        13.30       NA
3   54 [96]     11.11%         54     96        11.11       NA
4 168 [328]     52.38%        168    328        52.38       NA
5  75 [150]   [35.33%]         75    150        35.33       NA

Answer 1

幸运的是，使用 tidyverse 中的 tidyr 包非常容易。

library(tidyverse)

test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"), 
                       `% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")), 
                  Names = c("N [ears]", "% Otorrhea"), 
                  row.names = c(NA, 5L), class = "data.frame")

test %>% 
    separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\s\[", fill = "right") %>%
    separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\s\[", fill = "right") %>%
    mutate_each(funs(parse_number))
#>   N_patients N_ears pct_patients pct_ears
#> 1        173     NA        58.61       NA
#> 2         60     NA        13.30       NA
#> 3         54     96        11.11       NA
#> 4        168    328        52.38       NA
#> 5        906   1685        14.79    10.45

Answer 2

这里是 extract() 函数和正则表达式的替代方法：

library(tidyr)
test %>% 
        extract(`N [ears]`, into = c("N_patients", "N_ears"), 
                            regex = "^(\d+)(?:\s\[(\d+)\])?$") %>% 
        extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"), 
                              regex = "^([.0-9]+)%(?:\s\[([.0-9]+)%\])?$")

#  N_patients N_ears pct_patients pct_ears
#1        173   <NA>        58.61     <NA>
#2         60   <NA>        13.30     <NA>
#3         54     96        11.11     <NA>
#4        168    328        52.38     <NA>
#5        906   1685        14.79    10.45

这里我们可以使用非捕获组 (?:...) 和 ? 来捕获可选的耳朵列。

Answer 3

我的实际数据集的最佳答案在评论中由 https://whosebug.com/users/4497050/alistaire

如下所示，包装在一个简单的函数中。

  library(tidyverse)

    make_tidy <- function(untidy){
       tidy <- untidy %>% 
       separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>% 
       separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\d.]+', extra = 'drop', convert = TRUE)
    }

    tidy_2 <- make_tidy(untidy_2)

正确解析 untidy_2

> tidy_2
# A tibble: 5 × 4
  N_patients N_ears pct_patients pct_ears
*      <int>  <int>        <dbl>    <dbl>
1        173     NA        58.61       NA
2         60     NA        13.30       NA
3         54     96        11.11       NA
4        168    328        52.38       NA
5        906   1685        14.79    10.45

从每列两个变量和隐式缺失的不整洁数据集创建整洁数据集

Create tidy dataset from an untidy one with two variables per column and implicit missings

regex

r

tidyr