从每列两个变量和隐式缺失的不整洁数据集创建整洁数据集
Create tidy dataset from an untidy one with two variables per column and implicit missings
我有一个不整洁的数据集,它在两列中的每一列中组合了两个变量(一些缺失)(下面数据框中的一个小子样本 'test')。我正在努力创建下面所需的整洁数据集。
untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")
需要的数据框
N_patients N_ears pct_patients pct_ears
173 NA 58.61 NA
60 NA 13.30 NA
54 96 11.11 NA
168 328 14.79 10.45
谢谢!
似乎总是存在边缘情况——两个答案都没有考虑到第 5 行的某些内容。似乎只是一个正则表达式问题。有关如何修复的建议?
untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
即。第 5 行,[35.55%] 被解析为 pct_patients
N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1 173 58.61% 173 NA 58.61 NA
2 60 13.30% 60 NA 13.30 NA
3 54 [96] 11.11% 54 96 11.11 NA
4 168 [328] 52.38% 168 328 52.38 NA
5 75 [150] [35.33%] 75 150 35.33 NA
幸运的是,使用 tidyverse
中的 tidyr
包非常容易。
library(tidyverse)
test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"),
`% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")),
Names = c("N [ears]", "% Otorrhea"),
row.names = c(NA, 5L), class = "data.frame")
test %>%
separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\s\[", fill = "right") %>%
separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\s\[", fill = "right") %>%
mutate_each(funs(parse_number))
#> N_patients N_ears pct_patients pct_ears
#> 1 173 NA 58.61 NA
#> 2 60 NA 13.30 NA
#> 3 54 96 11.11 NA
#> 4 168 328 52.38 NA
#> 5 906 1685 14.79 10.45
这里是 extract()
函数和正则表达式的替代方法:
library(tidyr)
test %>%
extract(`N [ears]`, into = c("N_patients", "N_ears"),
regex = "^(\d+)(?:\s\[(\d+)\])?$") %>%
extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"),
regex = "^([.0-9]+)%(?:\s\[([.0-9]+)%\])?$")
# N_patients N_ears pct_patients pct_ears
#1 173 <NA> 58.61 <NA>
#2 60 <NA> 13.30 <NA>
#3 54 96 11.11 <NA>
#4 168 328 52.38 <NA>
#5 906 1685 14.79 10.45
这里我们可以使用非捕获组 (?:...)
和 ?
来捕获可选的耳朵列。
我的实际数据集的最佳答案在评论中由
https://whosebug.com/users/4497050/alistaire
如下所示,包装在一个简单的函数中。
library(tidyverse)
make_tidy <- function(untidy){
tidy <- untidy %>%
separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>%
separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\d.]+', extra = 'drop', convert = TRUE)
}
tidy_2 <- make_tidy(untidy_2)
正确解析 untidy_2
> tidy_2
# A tibble: 5 × 4
N_patients N_ears pct_patients pct_ears
* <int> <int> <dbl> <dbl>
1 173 NA 58.61 NA
2 60 NA 13.30 NA
3 54 96 11.11 NA
4 168 328 52.38 NA
5 906 1685 14.79 10.45
我有一个不整洁的数据集,它在两列中的每一列中组合了两个变量(一些缺失)(下面数据框中的一个小子样本 'test')。我正在努力创建下面所需的整洁数据集。
untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")
需要的数据框
N_patients N_ears pct_patients pct_ears
173 NA 58.61 NA
60 NA 13.30 NA
54 96 11.11 NA
168 328 14.79 10.45
谢谢!
似乎总是存在边缘情况——两个答案都没有考虑到第 5 行的某些内容。似乎只是一个正则表达式问题。有关如何修复的建议?
untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
即。第 5 行,[35.55%] 被解析为 pct_patients
N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1 173 58.61% 173 NA 58.61 NA
2 60 13.30% 60 NA 13.30 NA
3 54 [96] 11.11% 54 96 11.11 NA
4 168 [328] 52.38% 168 328 52.38 NA
5 75 [150] [35.33%] 75 150 35.33 NA
幸运的是,使用 tidyverse
中的 tidyr
包非常容易。
library(tidyverse)
test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"),
`% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")),
Names = c("N [ears]", "% Otorrhea"),
row.names = c(NA, 5L), class = "data.frame")
test %>%
separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\s\[", fill = "right") %>%
separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\s\[", fill = "right") %>%
mutate_each(funs(parse_number))
#> N_patients N_ears pct_patients pct_ears
#> 1 173 NA 58.61 NA
#> 2 60 NA 13.30 NA
#> 3 54 96 11.11 NA
#> 4 168 328 52.38 NA
#> 5 906 1685 14.79 10.45
这里是 extract()
函数和正则表达式的替代方法:
library(tidyr)
test %>%
extract(`N [ears]`, into = c("N_patients", "N_ears"),
regex = "^(\d+)(?:\s\[(\d+)\])?$") %>%
extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"),
regex = "^([.0-9]+)%(?:\s\[([.0-9]+)%\])?$")
# N_patients N_ears pct_patients pct_ears
#1 173 <NA> 58.61 <NA>
#2 60 <NA> 13.30 <NA>
#3 54 96 11.11 <NA>
#4 168 328 52.38 <NA>
#5 906 1685 14.79 10.45
这里我们可以使用非捕获组 (?:...)
和 ?
来捕获可选的耳朵列。
我的实际数据集的最佳答案在评论中由 https://whosebug.com/users/4497050/alistaire
如下所示,包装在一个简单的函数中。
library(tidyverse)
make_tidy <- function(untidy){
tidy <- untidy %>%
separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>%
separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\d.]+', extra = 'drop', convert = TRUE)
}
tidy_2 <- make_tidy(untidy_2)
正确解析 untidy_2
> tidy_2
# A tibble: 5 × 4
N_patients N_ears pct_patients pct_ears
* <int> <int> <dbl> <dbl>
1 173 NA 58.61 NA
2 60 NA 13.30 NA
3 54 96 11.11 NA
4 168 328 52.38 NA
5 906 1685 14.79 10.45