根据变量名的首字母将数据重塑为长格式
Reshape data to long format based on the first letter of the variable names
我正在尝试根据变量名的首字母将我的数据重塑为长格式。我有来自母亲和父亲的数据,它们由变量的第一个字母表示,就像在这个数据集中:
toydat <- data.frame(id=1:10,
mincome=rep(sample(1:5), 2),
medu=rep(sample(1:5), 2),
methnicity=rep(sample(1:5), 2),
fincome=rep(sample(1:5), 2),
fedu=rep(sample(1:5), 2),
fethnicity=rep(sample(1:5), 2)
)
最终数据应该是这样的
gender income edu ethnicity
mother 3 4 3
mother 2 2 4
mother 5 3 2
mother 3 4 2
mother 4 3 3
mother 2 2 1
mother 3 3 4
mother 4 4 4
mother 3 3 5
mother 2 2 1
father 5 5 2
father 3 3 3
father 4 2 2
father 2 2 4
father 3 1 5
father 4 4 1
father 4 5 2
father 3 2 3
father 3 3 2
father 1 2 1
如有任何帮助,我们将不胜感激!
编辑
感谢@akrun,我原来的问题已经解决了。我想知道如果性别指示符 m
或 f
位于名称的末尾会怎么样。如何以正则表达式的方式names_sep
?
通过尝试以下代码,尽管创建了性别变量,但变量没有被拆分。
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c(".value", "gender"),
names_sep = "(<=[a-z])(?=[mf]$)") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
# A tibble: 10 x 7
gender mincome medu methnicity fincome fedu fethnicity
<chr> <int> <int> <int> <int> <int> <int>
1 father 1 3 4 5 5 5
2 father 5 4 3 3 1 4
3 father 3 2 2 1 4 2
4 father 2 1 1 4 2 1
5 father 4 5 5 2 3 3
6 father 1 3 4 5 5 5
7 father 5 4 3 3 1 4
8 father 3 2 2 1 4 2
9 father 2 1 1 4 2 1
10 father 4 5 5 2 3 3
我们删除 'id' 列,然后将所有列转换为长格式,指定 names_se
p 在 'm' 或 'f' 之间拆分字符串的开始 (^
) 和正则表达式环视中的下一个字母,然后通过将 'm' 更改为 'mother' 和 'f' 来重新编码 'gender' 列'father' 在 case_when
library(dplyr)
library(tidyr)
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c("gender", ".value"),
names_sep = "(?<=^[mf])(?=[a-z])") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
-输出
# A tibble: 20 x 4
# gender income edu ethnicity
# <chr> <int> <int> <int>
# 1 mother 3 5 3
# 2 father 4 5 5
# 3 mother 4 3 5
# 4 father 3 1 1
# 5 mother 2 1 2
# 6 father 2 3 3
# 7 mother 1 2 1
# 8 father 5 2 4
# 9 mother 5 4 4
#10 father 1 4 2
#11 mother 3 5 3
#12 father 4 5 5
#13 mother 4 3 5
#14 father 3 1 1
#15 mother 2 1 2
#16 father 2 3 3
#17 mother 1 2 1
#18 father 5 2 4
#19 mother 5 4 4
#20 father 1 4 2
输出值与预期不同,因为 OP 在构建输入示例时使用了 sample
而没有 set.seed
对于编辑的部分,我们切换 names_to
并更改 names_sep
正则表达式环视切换
# // change the column names by rearranging the 'm|f'
# // at the end of the column name
names(toydat)[-1] <- sub("^(.)(.*)", "\2\1", names(toydat)[-1])
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c(".value", "gender"),
names_sep = "(?<=[a-z])(?=[mf]$)") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
-输出
# A tibble: 20 x 4
# gender income edu ethnicity
# <chr> <int> <int> <int>
# 1 mother 1 2 1
# 2 father 5 5 1
# 3 mother 5 4 3
# 4 father 4 4 2
# 5 mother 3 3 4
# 6 father 2 2 4
# 7 mother 4 5 2
# 8 father 3 1 3
# 9 mother 2 1 5
#10 father 1 3 5
#11 mother 1 2 1
#12 father 5 5 1
#13 mother 5 4 3
#14 father 4 4 2
#15 mother 3 3 4
#16 father 2 2 4
#17 mother 4 5 2
#18 father 3 1 3
#19 mother 2 1 5
#20 father 1 3 5
我正在尝试根据变量名的首字母将我的数据重塑为长格式。我有来自母亲和父亲的数据,它们由变量的第一个字母表示,就像在这个数据集中:
toydat <- data.frame(id=1:10,
mincome=rep(sample(1:5), 2),
medu=rep(sample(1:5), 2),
methnicity=rep(sample(1:5), 2),
fincome=rep(sample(1:5), 2),
fedu=rep(sample(1:5), 2),
fethnicity=rep(sample(1:5), 2)
)
最终数据应该是这样的
gender income edu ethnicity
mother 3 4 3
mother 2 2 4
mother 5 3 2
mother 3 4 2
mother 4 3 3
mother 2 2 1
mother 3 3 4
mother 4 4 4
mother 3 3 5
mother 2 2 1
father 5 5 2
father 3 3 3
father 4 2 2
father 2 2 4
father 3 1 5
father 4 4 1
father 4 5 2
father 3 2 3
father 3 3 2
father 1 2 1
如有任何帮助,我们将不胜感激!
编辑
感谢@akrun,我原来的问题已经解决了。我想知道如果性别指示符 m
或 f
位于名称的末尾会怎么样。如何以正则表达式的方式names_sep
?
通过尝试以下代码,尽管创建了性别变量,但变量没有被拆分。
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c(".value", "gender"),
names_sep = "(<=[a-z])(?=[mf]$)") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
# A tibble: 10 x 7
gender mincome medu methnicity fincome fedu fethnicity
<chr> <int> <int> <int> <int> <int> <int>
1 father 1 3 4 5 5 5
2 father 5 4 3 3 1 4
3 father 3 2 2 1 4 2
4 father 2 1 1 4 2 1
5 father 4 5 5 2 3 3
6 father 1 3 4 5 5 5
7 father 5 4 3 3 1 4
8 father 3 2 2 1 4 2
9 father 2 1 1 4 2 1
10 father 4 5 5 2 3 3
我们删除 'id' 列,然后将所有列转换为长格式,指定 names_se
p 在 'm' 或 'f' 之间拆分字符串的开始 (^
) 和正则表达式环视中的下一个字母,然后通过将 'm' 更改为 'mother' 和 'f' 来重新编码 'gender' 列'father' 在 case_when
library(dplyr)
library(tidyr)
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c("gender", ".value"),
names_sep = "(?<=^[mf])(?=[a-z])") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
-输出
# A tibble: 20 x 4
# gender income edu ethnicity
# <chr> <int> <int> <int>
# 1 mother 3 5 3
# 2 father 4 5 5
# 3 mother 4 3 5
# 4 father 3 1 1
# 5 mother 2 1 2
# 6 father 2 3 3
# 7 mother 1 2 1
# 8 father 5 2 4
# 9 mother 5 4 4
#10 father 1 4 2
#11 mother 3 5 3
#12 father 4 5 5
#13 mother 4 3 5
#14 father 3 1 1
#15 mother 2 1 2
#16 father 2 3 3
#17 mother 1 2 1
#18 father 5 2 4
#19 mother 5 4 4
#20 father 1 4 2
输出值与预期不同,因为 OP 在构建输入示例时使用了 sample
而没有 set.seed
对于编辑的部分,我们切换 names_to
并更改 names_sep
正则表达式环视切换
# // change the column names by rearranging the 'm|f'
# // at the end of the column name
names(toydat)[-1] <- sub("^(.)(.*)", "\2\1", names(toydat)[-1])
toydat %>%
select(-id) %>%
pivot_longer(cols = everything(),
names_to = c(".value", "gender"),
names_sep = "(?<=[a-z])(?=[mf]$)") %>%
mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
-输出
# A tibble: 20 x 4
# gender income edu ethnicity
# <chr> <int> <int> <int>
# 1 mother 1 2 1
# 2 father 5 5 1
# 3 mother 5 4 3
# 4 father 4 4 2
# 5 mother 3 3 4
# 6 father 2 2 4
# 7 mother 4 5 2
# 8 father 3 1 3
# 9 mother 2 1 5
#10 father 1 3 5
#11 mother 1 2 1
#12 father 5 5 1
#13 mother 5 4 3
#14 father 4 4 2
#15 mother 3 3 4
#16 father 2 2 4
#17 mother 4 5 2
#18 father 3 1 3
#19 mother 2 1 5
#20 father 1 3 5