咕噜咕噜或循环 table 来自动化绑定行过程

Purr or loop over table to automate the binding rows process

我有以下 table:

df <- tribble(~name,                                               ~label,     ~year,    ~id,
              "base1.dta", "Generated biographical information",                1990,  "gbi",
              "base2.dta", "Generated biographical information",                1991,  "gbi",
              "base3.dta", "Generated biographical information",                1992,  "gbi",
              "base4.dta", "Generated biographical information",                1993,  "gbi",
              "base5.dta", "Data on children from household questionnaire",     1990, "dchq",
              "base6.dta", "Data on children from household questionnaire",     1991, "dchq",
              "base7.dta", "Data on children from household questionnaire",     1992, "dchq",
              "base8.dta", "Data on children from household questionnaire",     1993, "dchq",
              "base9.dta", "Data from individual questionnaires",                1990,  "diq",
              "base10.dta", "Data from individual questionnaires",               1991,  "diq",
              "base11.dta", "Data from individual questionnaires",               1992,  "diq",
              "base12.dta", "Data from individual questionnaires",               1993,  "diq")

name栏中包含的数据框都在我项目的同一个路径下,与df中同名。我想通过以下方式循环或 purrr 这个 table (当然要长得多):如果它们在标签列中具有相同的值,则搜索名称列和 [=16 提供的相应名称=] 所有这些数据框并将它们分配给一个名为 id 的数据框。然后,我想将那些以id命名的对象保存为.rds在不同的路径中。

我复制了你的data.frame,并保存了mtcarsiris两次。要使该过程自动化,您可以从 split 您的 data.framelabel 开始,我假设您想要 bind_rows on.

然后我使用嵌套的 map 来读取你的 data.frame 给出的 path 调用 df(在我的示例中 data_test)并使用 read.table.

显然您可以使用任何类型的数据加载函数。

data_test <- tribble(
        ~name, ~label,
        "mtcars_1.csv", "mtcars",
        "mtcars_2.csv", "mtcars",
        "iris_1.csv", "iris",
        "iris_2.csv", "iris"
)


data_test %>% split(
        f = .$label
) %>% map(
        .f = function(x) {
                
                x$name %>% map(.f = function(x){
                        
                        read.table(x)
                       
                        }
                        
                        ) %>% reduce(bind_rows)
                
        }
)

这将加载 name 变量下给定的所有 data.frame,相应地按 labelbind_rows 分组。

编辑: 正如@Anoushiravan 指出的那样,您可以将 read.table 替换为 haven::read_dta(x) 以从 stata![=30 加载数据=]

鉴于您的 labelid 列都以相同的模式重复,并且您希望输出由 id 标记,您可以忽略 label .

您也不需要 purrr - 只需按 idname 分组,读入您的数据,然后使用 summarise.[= 绑定行25=]

使用@Serkan 的 data_test 添加了 id 列。

library(tidyverse)

data_test %>% 
  group_by(id, name) %>% 
  summarise(df = list(read.csv(name))) %>% 
  summarise(joined = list(bind_rows(df)))

  id    joined        
  <chr> <list>        
1 iri   <df [300 × 5]>
2 mtc   <df [64 × 11]>

要写入 Rds,可以按 id 分组,然后 write_rds

... %>% 
  group_by(id_) %>% 
  group_walk(~write_rds(.x$joined, paste0(.y$id_, ".rds")))

数据

data_test <- tribble(
  ~name, ~label, ~id,
  "mtcars_1.csv", "mtcars", "mtc",
  "mtcars_2.csv", "mtcars", "mtc",
  "iris_1.csv", "iris", "iri",
  "iris_2.csv", "iris", "iri"
)