使用从标签派生的缩写以编程方式重命名数据框列

Programmatically rename dataframe columns using abbreviations derived from labels

我有一个带有任意列名称的标记数据框,我想使用标签以非任意方式重命名这些列。

这是数据框的简化版本:

library(labelled)
library(tidyverse)

df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)

对于要重命名的列,每个标签有两到三个组成部分(以冒号分隔 + 单个 space):

var_label(df) <-
  list(
    id = "ID",
    B101 = "Estimates: Less than ,000: Less than 20.0 percent", 
    B102 = "Estimates: ,000 to ,999: 20.0 to 24.9 percent",
    B103 = "Estimates: ,000 to ,999",
    B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
  )

因此,每列的标签可能有两个组件(例如 B103)或三个组件(例如 B102)。如果标签具有三个组件中的 none 个(例如,id),则无需重命名该列。

我想将标签的组成部分简写如下:

然后,我想通过连接组件来重命名每个变量,这些组件将用下划线分隔。显然,以临时方式手动执行此操作很简单:

df %>% 
  rename(e_i0to9_p0to19 = B101,
         e_i10to19_p20to24 = B102,
         e_i10to19 = B103, 
         m_i0to9_p0to19 = B104)

但是我如何使用 tidyverse 原则和包以编程方式完成此操作?

这是一种方法:

list(
    id = "ID",
    B101 = "Estimates: Less than ,000: Less than 20.0 percent", 
    B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
    B103 = "Estimates: ,000 to ,999",
    B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
) %>% 
    stringr::str_replace_all(
        c('Estimates: ' = 'e_', "Margins of error: " = 'm', "Less than \,000: " = 'i0to9',
          "\,000 to \,999[:]?[ ]?" = 'i10to19', "Less than 20.0 percent" = 'p0to19', 
          "20.0 to 24.9 percent" = 'p20to24')
    ) %>% 
    setNames(names(df), .) %>% 
    rename(df, .)

输出:

# A tibble: 1 x 5
  ID    e_i0to9p0to19 e_i0to9p20to24 e_i10to19 mi0to9p0to19
  <chr>         <dbl>          <dbl>     <dbl>        <dbl>
1 a                 1              2         3          0.1

我们可以直接使用 labelled 包中的 var_label(df) 修改数据帧的属性,因为它已经被使用了。

您将获得一份清单。然后您可以使用 map 遍历列表。我对重复 map(., ... 不是很满意,但此时我不知道如何应用 DRY(不要重复自己)想法:

library(tidyverse)
library(labelled)

colnames(df) <- var_label(df) %>% 
  map(., ~str_replace(., "Estimates:", "e")[[1]]) %>% 
  map(., ~str_replace(., "Margins of error:", "m")[[1]]) %>% 
  map(., ~str_replace(., "Less than \,000\:", "i0to9")[[1]]) %>% 
  map(., ~str_replace(., "\,000 to \,999", "i10to19")[[1]]) %>% 
  map(., ~str_replace(., "Less than 20.0 percent", "p0to19")[[1]]) %>% 
  map(., ~str_replace(., "20.0 to 24.9 percent", "p20to24")[[1]]) %>% 
  map(., ~str_replace_all(., " ", "_")[[1]]) %>% 
  map(., ~str_replace_all(., ":", "")[[1]])

  e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
           <dbl>           <dbl>     <dbl>          <dbl>
1              1               2         3            0.1

这是一个稍微冗长的解决方案,其目标是对映射的结构或值的任何更改都具有高度灵活性。如果你的问题是一次性的,我推荐这里已经给出的其他很好的答案。我在最后回顾了这个解决方案的好处。

首先在 table 中定义您的映射 - 这使您可以在将来轻松更改它们或在必要时添加新映射:

library(tidyverse)
labels = list(
  B101 = "Estimates: Less than ,000: Less than 20.0 percent", 
  B102 = "Estimates: Less than ,000: 20.0 to 24.9 percent",
  B103 = "Estimates: ,000 to ,999",
  B104 = "Margins of error: Less than ,000: Less than 20.0 percent"
)

components = tribble(
  ~ id, ~ name, ~ new_name,
  1, "Estimates", "e",
  1, "Margins of error", "m",
  2, "Less than ,000", "i0to9",
  2, ",000 to ,999", "i10to19",
  3, "Less than 20.0 percent", "p0to19",
  3, "20.0 to 24.9 percent", "p20to24"
)

由此我们可以生成一个正则表达式:

component_regex = components %>%
  split(.$id) %>%
  # Fix dollar signs
  map(~ str_replace_all(.x$name, "\$", "\\$")) %>%
  # Include a regex condition for the possibly of there being a colon
  map(~ map_chr(.x, paste0, "[\:]?")) %>%
  map_chr(paste, collapse = "|") %>%
  # Some components may not be present
  paste0("(", ., ")?") %>%
  # Spaces in between each component
  paste(collapse = "[ ]?")

这是正则表达式:

component_regex
#> [1] "(Estimates[\:]?|Margins of error[\:]?)?[ ]?(Less than \,000[\:]?|\,000 to \,999[\:]?)?[ ]?(Less than 20.0 percent[\:]?|20.0 to 24.9 percent[\:]?)?"

现在我们从每个标签中提取组件以创建数据框:

data_labels = labels %>% 
  map(str_match, pattern = component_regex) %>%
  map(as.data.frame) %>% 
  reduce(bind_rows) %>%
  select(-V1) %>%
  map_df(str_replace, pattern = ":$", replacement = "") %>%
  mutate(col_name = names(labels))

# A tibble: 4 x 4
  V2               V3                 V4                     col_name
  <chr>            <chr>              <chr>                  <chr>   
1 Estimates        Less than ,000  Less than 20.0 percent B101    
2 Estimates        Less than ,000  20.0 to 24.9 percent   B102    
3 Estimates        ,000 to ,999 NA                     B103    
4 Margins of error Less than ,000  Less than 20.0 percent B104    

现在我们转换这个 table 以便我们可以加入之前的 components table 并提取新名称。我将首先显示部分结果,以便您了解发生了什么:

data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name"))

# A tibble: 11 x 4
   col_name    id value                  new_name
   <chr>    <dbl> <chr>                  <chr>   
 1 B101         1 Estimates              e       
 2 B101         2 Less than ,000      i0to9   
 3 B101         3 Less than 20.0 percent p0to19  
 4 B102         1 Estimates              e       
 5 B102         2 Less than ,000      i0to9   
 6 B102         3 20.0 to 24.9 percent   p20to24 
 7 B103         1 Estimates              e       
 8 B103         2 ,000 to ,999     i10to19 
 9 B104         1 Margins of error       m       
10 B104         2 Less than ,000      i0to9   
11 B104         3 Less than 20.0 percent p0to19  

请注意,inner_join() 使得没有第三个组件的情况从数据中被忽略。完成方法如下:

new_names = data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name")) %>%
  group_by(col_name) %>%
  summarise(final_name = paste(new_name[sort(id)], collapse = "_"))

# A tibble: 4 x 2
  col_name final_name     
  <chr>    <chr>          
1 B101     e_i0to9_p0to19 
2 B102     e_i0to9_p20to24
3 B103     e_i10to19      
4 B104     m_i0to9_p0to19 

我们现在只需将名称替换为新名称:

old_names = intersect(names(df), new_names$col_name)
df %>% 
  rename_with(
    ~ new_names$final_name[which(old_names == .x)], 
    .cols = all_of(old_names)
  )
# A tibble: 1 x 5
  id    e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
  <chr>          <dbl>           <dbl>     <dbl>          <dbl>
1 a                  1               2         3            0.1

这个解决方案可能看起来很长,但它有一些好处:

  • 映射可以存储在 CSV 文件中并在代码之外进行修改。也就是说,代码实际上并不依赖于您的映射。
  • 您可以添加或删除每个组件的部分内容。
  • 无论是否缺少任何组件,它都有效。
  • 它适用于三个以上的组件。
df %>%
  set_names(var_label(.) %>%
  unlist() %>%
  str_replace_all(c("Estimates: " = 'e',
      "Margins of error:" = "m",
      "Less than \,000:?" = "i0to9",
      "\,000 to \,999" ="i10to19",
      "Less than 20.0 percent" = "p0to19",
      "20.0 to 24.9 percent" = "p20to24",
      ' ' = '_')))
# A tibble: 1 x 5
  ID    ei0to9_p0to19 ei0to9_p20to24 ei10to19 m_i0to9_p0to19
  <chr>         <dbl>          <dbl>    <dbl>          <dbl>
1 a                 1              2        3            0.1