将多个列重塑为具有不同时间的 2 个时间变量?
reshape multiple columns to 2 timevars with different times?
我有以下数据框:
date clinic MALE_0_1 MALE_1_2 MALE_2_3 ... MALE_94_95 MALE_95+ FEMALE_0_1 FEMALE_1_2 ... FEMALE_95+
2017-01-01 A 30 25 40 ... 70 90 28 22 ... 40
2017-01-01 B 21 15 30 ... 45 27 31 40 ... 55
2017-02-01 C 29 35 45 ... 34 25 33 38 ... 45
我怎样才能创建一个这样的:
date clinic GENDER AGE NUMBER_PATIENTS
2017-01-01 A MALE 0 30
2017-01-01 A FEMALE 0 28
2017-01-01 A MALE 1 25
2017-01-01 A FEMALE 1 22
....
2017-01-01 A MALE 95+ 90
2017-01-01 A FEMALE 95+ 40
2017-01-01 B MALE 0 21
2017-01-01 B FEMALE 0 31
....
2017-02-01 C MALE 0 29
2017-02-01 C FEMALE 0 33
MALE_0_1
相当于AGE=0,MALE_1_2
相当于AGE=1等
下面的代码 - 我应该如何在 times
中同时包含“性别”的 FEMALE、MALE 和“AGE”的 0:95?
df <- reshape(df,
direction = "long",
varying = list(names(df)[3:194]),
v.names = "NUMBER_OF_PATIENTS",
idvar = c("date", "clinic"),
timevar = c("GENDER", "AGE"),
times = ???)
试试这个接近你想要的方法:
library(tidyverse)
#Code
newdf <- df %>%
mutate(across(-date,~as.character(.))) %>%
pivot_longer(-c(date,clinic)) %>%
separate(name,c('Gender','V1','V2'),sep='_') %>%
mutate(value=as.numeric(value))
输出:
# A tibble: 24 x 6
date clinic Gender V1 V2 value
<date> <chr> <chr> <chr> <chr> <dbl>
1 2017-01-01 A MALE 0 1 30
2 2017-01-01 A MALE 1 2 25
3 2017-01-01 A MALE 2 3 40
4 2017-01-01 A MALE 94 95 70
5 2017-01-01 A MALE 95. NA 90
6 2017-01-01 A FEMALE 0 1 28
7 2017-01-01 A FEMALE 1 2 22
8 2017-01-01 A FEMALE 95. NA 40
9 2017-01-01 B MALE 0 1 21
10 2017-01-01 B MALE 1 2 15
# ... with 14 more rows
您可以在 pivot_longer
中指定要提取的模式。
tidyr::pivot_longer(df, cols = -c(date, clinic),
names_to = c('GENDER', 'AGE'),
names_pattern = '(.*?)_(\d+\+?)',
values_to = 'NUMBER_PATIENTS')
# date clinic GENDER AGE NUMBER_PATIENTS
# <chr> <chr> <chr> <chr> <int>
# 1 2017-01-01 A MALE 0 30
# 2 2017-01-01 A MALE 1 25
# 3 2017-01-01 A MALE 2 40
# 4 2017-01-01 A MALE 94 70
# 5 2017-01-01 A MALE 95+ 90
# 6 2017-01-01 A FEMALE 0 28
# 7 2017-01-01 A FEMALE 1 22
# 8 2017-01-01 A FEMALE 95+ 40
# 9 2017-01-01 B MALE 0 21
#10 2017-01-01 B MALE 1 15
# … with 14 more rows
其中 (.*?)_(\d+\+?)
创建一个正则表达式模式以从两组中的列名称中提取数据。第一组是第一个下划线之前的所有内容,第二组是带有可选 +
符号的数字。
数据
df <- structure(list(date = c("2017-01-01", "2017-01-01", "2017-02-01"
), clinic = c("A", "B", "C"), MALE_0_1 = c(30L, 21L, 29L), MALE_1_2 = c(25L,
15L, 35L), MALE_2_3 = c(40L, 30L, 45L), MALE_94_95 = c(70L, 45L,
34L), `MALE_95+` = c(90L, 27L, 25L), FEMALE_0_1 = c(28L, 31L,
33L), FEMALE_1_2 = c(22L, 40L, 38L), `FEMALE_95+` = c(40L, 55L,
45L)), class = "data.frame", row.names = c(NA, -3L))
我有以下数据框:
date clinic MALE_0_1 MALE_1_2 MALE_2_3 ... MALE_94_95 MALE_95+ FEMALE_0_1 FEMALE_1_2 ... FEMALE_95+
2017-01-01 A 30 25 40 ... 70 90 28 22 ... 40
2017-01-01 B 21 15 30 ... 45 27 31 40 ... 55
2017-02-01 C 29 35 45 ... 34 25 33 38 ... 45
我怎样才能创建一个这样的:
date clinic GENDER AGE NUMBER_PATIENTS
2017-01-01 A MALE 0 30
2017-01-01 A FEMALE 0 28
2017-01-01 A MALE 1 25
2017-01-01 A FEMALE 1 22
....
2017-01-01 A MALE 95+ 90
2017-01-01 A FEMALE 95+ 40
2017-01-01 B MALE 0 21
2017-01-01 B FEMALE 0 31
....
2017-02-01 C MALE 0 29
2017-02-01 C FEMALE 0 33
MALE_0_1
相当于AGE=0,MALE_1_2
相当于AGE=1等
下面的代码 - 我应该如何在 times
中同时包含“性别”的 FEMALE、MALE 和“AGE”的 0:95?
df <- reshape(df,
direction = "long",
varying = list(names(df)[3:194]),
v.names = "NUMBER_OF_PATIENTS",
idvar = c("date", "clinic"),
timevar = c("GENDER", "AGE"),
times = ???)
试试这个接近你想要的方法:
library(tidyverse)
#Code
newdf <- df %>%
mutate(across(-date,~as.character(.))) %>%
pivot_longer(-c(date,clinic)) %>%
separate(name,c('Gender','V1','V2'),sep='_') %>%
mutate(value=as.numeric(value))
输出:
# A tibble: 24 x 6
date clinic Gender V1 V2 value
<date> <chr> <chr> <chr> <chr> <dbl>
1 2017-01-01 A MALE 0 1 30
2 2017-01-01 A MALE 1 2 25
3 2017-01-01 A MALE 2 3 40
4 2017-01-01 A MALE 94 95 70
5 2017-01-01 A MALE 95. NA 90
6 2017-01-01 A FEMALE 0 1 28
7 2017-01-01 A FEMALE 1 2 22
8 2017-01-01 A FEMALE 95. NA 40
9 2017-01-01 B MALE 0 1 21
10 2017-01-01 B MALE 1 2 15
# ... with 14 more rows
您可以在 pivot_longer
中指定要提取的模式。
tidyr::pivot_longer(df, cols = -c(date, clinic),
names_to = c('GENDER', 'AGE'),
names_pattern = '(.*?)_(\d+\+?)',
values_to = 'NUMBER_PATIENTS')
# date clinic GENDER AGE NUMBER_PATIENTS
# <chr> <chr> <chr> <chr> <int>
# 1 2017-01-01 A MALE 0 30
# 2 2017-01-01 A MALE 1 25
# 3 2017-01-01 A MALE 2 40
# 4 2017-01-01 A MALE 94 70
# 5 2017-01-01 A MALE 95+ 90
# 6 2017-01-01 A FEMALE 0 28
# 7 2017-01-01 A FEMALE 1 22
# 8 2017-01-01 A FEMALE 95+ 40
# 9 2017-01-01 B MALE 0 21
#10 2017-01-01 B MALE 1 15
# … with 14 more rows
其中 (.*?)_(\d+\+?)
创建一个正则表达式模式以从两组中的列名称中提取数据。第一组是第一个下划线之前的所有内容,第二组是带有可选 +
符号的数字。
数据
df <- structure(list(date = c("2017-01-01", "2017-01-01", "2017-02-01"
), clinic = c("A", "B", "C"), MALE_0_1 = c(30L, 21L, 29L), MALE_1_2 = c(25L,
15L, 35L), MALE_2_3 = c(40L, 30L, 45L), MALE_94_95 = c(70L, 45L,
34L), `MALE_95+` = c(90L, 27L, 25L), FEMALE_0_1 = c(28L, 31L,
33L), FEMALE_1_2 = c(22L, 40L, 38L), `FEMALE_95+` = c(40L, 55L,
45L)), class = "data.frame", row.names = c(NA, -3L))