使用从标签派生的缩写以编程方式重命名数据框列
Programmatically rename dataframe columns using abbreviations derived from labels
我有一个带有任意列名称的标记数据框,我想使用标签以非任意方式重命名这些列。
这是数据框的简化版本:
library(labelled)
library(tidyverse)
df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)
对于要重命名的列,每个标签有两到三个组成部分(以冒号分隔 + 单个 space):
var_label(df) <-
list(
id = "ID",
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: ,000 to ,999: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
)
因此,每列的标签可能有两个组件(例如 B103)或三个组件(例如 B102)。如果标签具有三个组件中的 none 个(例如,id),则无需重命名该列。
我想将标签的组成部分简写如下:
- 组件 1
- “估计:” -> e
- “误差幅度:”->m
- 组件 2
- “少于 10,000 美元:”或“少于 10,000 美元”-> i0to9
- “10,000 美元至 19,999 美元:”或“10,000 美元至 19,999 美元”-> i10to19
- 组件 3
- “小于 20.0%”-> p0to19
- “20.0% 到 24.9%”-> p20to24
然后,我想通过连接组件来重命名每个变量,这些组件将用下划线分隔。显然,以临时方式手动执行此操作很简单:
df %>%
rename(e_i0to9_p0to19 = B101,
e_i10to19_p20to24 = B102,
e_i10to19 = B103,
m_i0to9_p0to19 = B104)
但是我如何使用 tidyverse 原则和包以编程方式完成此操作?
这是一种方法:
list(
id = "ID",
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
) %>%
stringr::str_replace_all(
c('Estimates: ' = 'e_', "Margins of error: " = 'm', "Less than \,000: " = 'i0to9',
"\,000 to \,999[:]?[ ]?" = 'i10to19', "Less than 20.0 percent" = 'p0to19',
"20.0 to 24.9 percent" = 'p20to24')
) %>%
setNames(names(df), .) %>%
rename(df, .)
输出:
# A tibble: 1 x 5
ID e_i0to9p0to19 e_i0to9p20to24 e_i10to19 mi0to9p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
我们可以直接使用 labelled
包中的 var_label(df)
修改数据帧的属性,因为它已经被使用了。
您将获得一份清单。然后您可以使用 map
遍历列表。我对重复 map(., ...
不是很满意,但此时我不知道如何应用 DRY(不要重复自己)想法:
library(tidyverse)
library(labelled)
colnames(df) <- var_label(df) %>%
map(., ~str_replace(., "Estimates:", "e")[[1]]) %>%
map(., ~str_replace(., "Margins of error:", "m")[[1]]) %>%
map(., ~str_replace(., "Less than \,000\:", "i0to9")[[1]]) %>%
map(., ~str_replace(., "\,000 to \,999", "i10to19")[[1]]) %>%
map(., ~str_replace(., "Less than 20.0 percent", "p0to19")[[1]]) %>%
map(., ~str_replace(., "20.0 to 24.9 percent", "p20to24")[[1]]) %>%
map(., ~str_replace_all(., " ", "_")[[1]]) %>%
map(., ~str_replace_all(., ":", "")[[1]])
e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 0.1
这是一个稍微冗长的解决方案,其目标是对映射的结构或值的任何更改都具有高度灵活性。如果你的问题是一次性的,我推荐这里已经给出的其他很好的答案。我在最后回顾了这个解决方案的好处。
首先在 table 中定义您的映射 - 这使您可以在将来轻松更改它们或在必要时添加新映射:
library(tidyverse)
labels = list(
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
)
components = tribble(
~ id, ~ name, ~ new_name,
1, "Estimates", "e",
1, "Margins of error", "m",
2, "Less than ,000", "i0to9",
2, ",000 to ,999", "i10to19",
3, "Less than 20.0 percent", "p0to19",
3, "20.0 to 24.9 percent", "p20to24"
)
由此我们可以生成一个正则表达式:
component_regex = components %>%
split(.$id) %>%
# Fix dollar signs
map(~ str_replace_all(.x$name, "\$", "\\$")) %>%
# Include a regex condition for the possibly of there being a colon
map(~ map_chr(.x, paste0, "[\:]?")) %>%
map_chr(paste, collapse = "|") %>%
# Some components may not be present
paste0("(", ., ")?") %>%
# Spaces in between each component
paste(collapse = "[ ]?")
这是正则表达式:
component_regex
#> [1] "(Estimates[\:]?|Margins of error[\:]?)?[ ]?(Less than \,000[\:]?|\,000 to \,999[\:]?)?[ ]?(Less than 20.0 percent[\:]?|20.0 to 24.9 percent[\:]?)?"
现在我们从每个标签中提取组件以创建数据框:
data_labels = labels %>%
map(str_match, pattern = component_regex) %>%
map(as.data.frame) %>%
reduce(bind_rows) %>%
select(-V1) %>%
map_df(str_replace, pattern = ":$", replacement = "") %>%
mutate(col_name = names(labels))
# A tibble: 4 x 4
V2 V3 V4 col_name
<chr> <chr> <chr> <chr>
1 Estimates Less than ,000 Less than 20.0 percent B101
2 Estimates Less than ,000 20.0 to 24.9 percent B102
3 Estimates ,000 to ,999 NA B103
4 Margins of error Less than ,000 Less than 20.0 percent B104
现在我们转换这个 table 以便我们可以加入之前的 components
table 并提取新名称。我将首先显示部分结果,以便您了解发生了什么:
data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name"))
# A tibble: 11 x 4
col_name id value new_name
<chr> <dbl> <chr> <chr>
1 B101 1 Estimates e
2 B101 2 Less than ,000 i0to9
3 B101 3 Less than 20.0 percent p0to19
4 B102 1 Estimates e
5 B102 2 Less than ,000 i0to9
6 B102 3 20.0 to 24.9 percent p20to24
7 B103 1 Estimates e
8 B103 2 ,000 to ,999 i10to19
9 B104 1 Margins of error m
10 B104 2 Less than ,000 i0to9
11 B104 3 Less than 20.0 percent p0to19
请注意,inner_join()
使得没有第三个组件的情况从数据中被忽略。完成方法如下:
new_names = data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name")) %>%
group_by(col_name) %>%
summarise(final_name = paste(new_name[sort(id)], collapse = "_"))
# A tibble: 4 x 2
col_name final_name
<chr> <chr>
1 B101 e_i0to9_p0to19
2 B102 e_i0to9_p20to24
3 B103 e_i10to19
4 B104 m_i0to9_p0to19
我们现在只需将名称替换为新名称:
old_names = intersect(names(df), new_names$col_name)
df %>%
rename_with(
~ new_names$final_name[which(old_names == .x)],
.cols = all_of(old_names)
)
# A tibble: 1 x 5
id e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
这个解决方案可能看起来很长,但它有一些好处:
- 映射可以存储在 CSV 文件中并在代码之外进行修改。也就是说,代码实际上并不依赖于您的映射。
- 您可以添加或删除每个组件的部分内容。
- 无论是否缺少任何组件,它都有效。
- 它适用于三个以上的组件。
df %>%
set_names(var_label(.) %>%
unlist() %>%
str_replace_all(c("Estimates: " = 'e',
"Margins of error:" = "m",
"Less than \,000:?" = "i0to9",
"\,000 to \,999" ="i10to19",
"Less than 20.0 percent" = "p0to19",
"20.0 to 24.9 percent" = "p20to24",
' ' = '_')))
# A tibble: 1 x 5
ID ei0to9_p0to19 ei0to9_p20to24 ei10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
我有一个带有任意列名称的标记数据框,我想使用标签以非任意方式重命名这些列。
这是数据框的简化版本:
library(labelled)
library(tidyverse)
df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)
对于要重命名的列,每个标签有两到三个组成部分(以冒号分隔 + 单个 space):
var_label(df) <-
list(
id = "ID",
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: ,000 to ,999: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
)
因此,每列的标签可能有两个组件(例如 B103)或三个组件(例如 B102)。如果标签具有三个组件中的 none 个(例如,id),则无需重命名该列。
我想将标签的组成部分简写如下:
- 组件 1
- “估计:” -> e
- “误差幅度:”->m
- 组件 2
- “少于 10,000 美元:”或“少于 10,000 美元”-> i0to9
- “10,000 美元至 19,999 美元:”或“10,000 美元至 19,999 美元”-> i10to19
- 组件 3
- “小于 20.0%”-> p0to19
- “20.0% 到 24.9%”-> p20to24
然后,我想通过连接组件来重命名每个变量,这些组件将用下划线分隔。显然,以临时方式手动执行此操作很简单:
df %>%
rename(e_i0to9_p0to19 = B101,
e_i10to19_p20to24 = B102,
e_i10to19 = B103,
m_i0to9_p0to19 = B104)
但是我如何使用 tidyverse 原则和包以编程方式完成此操作?
这是一种方法:
list(
id = "ID",
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
) %>%
stringr::str_replace_all(
c('Estimates: ' = 'e_', "Margins of error: " = 'm', "Less than \,000: " = 'i0to9',
"\,000 to \,999[:]?[ ]?" = 'i10to19', "Less than 20.0 percent" = 'p0to19',
"20.0 to 24.9 percent" = 'p20to24')
) %>%
setNames(names(df), .) %>%
rename(df, .)
输出:
# A tibble: 1 x 5
ID e_i0to9p0to19 e_i0to9p20to24 e_i10to19 mi0to9p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
我们可以直接使用 labelled
包中的 var_label(df)
修改数据帧的属性,因为它已经被使用了。
您将获得一份清单。然后您可以使用 map
遍历列表。我对重复 map(., ...
不是很满意,但此时我不知道如何应用 DRY(不要重复自己)想法:
library(tidyverse)
library(labelled)
colnames(df) <- var_label(df) %>%
map(., ~str_replace(., "Estimates:", "e")[[1]]) %>%
map(., ~str_replace(., "Margins of error:", "m")[[1]]) %>%
map(., ~str_replace(., "Less than \,000\:", "i0to9")[[1]]) %>%
map(., ~str_replace(., "\,000 to \,999", "i10to19")[[1]]) %>%
map(., ~str_replace(., "Less than 20.0 percent", "p0to19")[[1]]) %>%
map(., ~str_replace(., "20.0 to 24.9 percent", "p20to24")[[1]]) %>%
map(., ~str_replace_all(., " ", "_")[[1]]) %>%
map(., ~str_replace_all(., ":", "")[[1]])
e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 0.1
这是一个稍微冗长的解决方案,其目标是对映射的结构或值的任何更改都具有高度灵活性。如果你的问题是一次性的,我推荐这里已经给出的其他很好的答案。我在最后回顾了这个解决方案的好处。
首先在 table 中定义您的映射 - 这使您可以在将来轻松更改它们或在必要时添加新映射:
library(tidyverse)
labels = list(
B101 = "Estimates: Less than ,000: Less than 20.0 percent",
B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
B103 = "Estimates: ,000 to ,999",
B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
)
components = tribble(
~ id, ~ name, ~ new_name,
1, "Estimates", "e",
1, "Margins of error", "m",
2, "Less than ,000", "i0to9",
2, ",000 to ,999", "i10to19",
3, "Less than 20.0 percent", "p0to19",
3, "20.0 to 24.9 percent", "p20to24"
)
由此我们可以生成一个正则表达式:
component_regex = components %>%
split(.$id) %>%
# Fix dollar signs
map(~ str_replace_all(.x$name, "\$", "\\$")) %>%
# Include a regex condition for the possibly of there being a colon
map(~ map_chr(.x, paste0, "[\:]?")) %>%
map_chr(paste, collapse = "|") %>%
# Some components may not be present
paste0("(", ., ")?") %>%
# Spaces in between each component
paste(collapse = "[ ]?")
这是正则表达式:
component_regex
#> [1] "(Estimates[\:]?|Margins of error[\:]?)?[ ]?(Less than \,000[\:]?|\,000 to \,999[\:]?)?[ ]?(Less than 20.0 percent[\:]?|20.0 to 24.9 percent[\:]?)?"
现在我们从每个标签中提取组件以创建数据框:
data_labels = labels %>%
map(str_match, pattern = component_regex) %>%
map(as.data.frame) %>%
reduce(bind_rows) %>%
select(-V1) %>%
map_df(str_replace, pattern = ":$", replacement = "") %>%
mutate(col_name = names(labels))
# A tibble: 4 x 4
V2 V3 V4 col_name
<chr> <chr> <chr> <chr>
1 Estimates Less than ,000 Less than 20.0 percent B101
2 Estimates Less than ,000 20.0 to 24.9 percent B102
3 Estimates ,000 to ,999 NA B103
4 Margins of error Less than ,000 Less than 20.0 percent B104
现在我们转换这个 table 以便我们可以加入之前的 components
table 并提取新名称。我将首先显示部分结果,以便您了解发生了什么:
data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name"))
# A tibble: 11 x 4
col_name id value new_name
<chr> <dbl> <chr> <chr>
1 B101 1 Estimates e
2 B101 2 Less than ,000 i0to9
3 B101 3 Less than 20.0 percent p0to19
4 B102 1 Estimates e
5 B102 2 Less than ,000 i0to9
6 B102 3 20.0 to 24.9 percent p20to24
7 B103 1 Estimates e
8 B103 2 ,000 to ,999 i10to19
9 B104 1 Margins of error m
10 B104 2 Less than ,000 i0to9
11 B104 3 Less than 20.0 percent p0to19
请注意,inner_join()
使得没有第三个组件的情况从数据中被忽略。完成方法如下:
new_names = data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name")) %>%
group_by(col_name) %>%
summarise(final_name = paste(new_name[sort(id)], collapse = "_"))
# A tibble: 4 x 2
col_name final_name
<chr> <chr>
1 B101 e_i0to9_p0to19
2 B102 e_i0to9_p20to24
3 B103 e_i10to19
4 B104 m_i0to9_p0to19
我们现在只需将名称替换为新名称:
old_names = intersect(names(df), new_names$col_name)
df %>%
rename_with(
~ new_names$final_name[which(old_names == .x)],
.cols = all_of(old_names)
)
# A tibble: 1 x 5
id e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
这个解决方案可能看起来很长,但它有一些好处:
- 映射可以存储在 CSV 文件中并在代码之外进行修改。也就是说,代码实际上并不依赖于您的映射。
- 您可以添加或删除每个组件的部分内容。
- 无论是否缺少任何组件,它都有效。
- 它适用于三个以上的组件。
df %>%
set_names(var_label(.) %>%
unlist() %>%
str_replace_all(c("Estimates: " = 'e',
"Margins of error:" = "m",
"Less than \,000:?" = "i0to9",
"\,000 to \,999" ="i10to19",
"Less than 20.0 percent" = "p0to19",
"20.0 to 24.9 percent" = "p20to24",
' ' = '_')))
# A tibble: 1 x 5
ID ei0to9_p0to19 ei0to9_p20to24 ei10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1