map2 over columns 并将 .txt 数据读入 R 存储为列表元素

map2 over columns and read .txt data into R storing as a list element

我有以下数据:

                                                                                  folders                    Name
1              C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip  econ_indicator_map.txt
2              C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip entity_sub_type_map.txt
3   C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip  sym_merged_fsym_id.txt
4 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip     ent_scr_address.txt
   Length                Date
1    1822 2021-07-03 01:48:00
2     925 2021-07-03 01:48:00
3 1180324 2021-07-03 01:26:00
4 4506085 2021-07-03 04:11:00

我想 map2 遍历 foldersName 列并将文件解压缩到每个 zip 文件夹中。

因此,在文件夹 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip(第 1 行和第 2 行)中,它包含 2 个 .txt 文件,我想使用 read.table.

读取这些文件

以下无效。

zips %>% 
  map2(
    .x = folders,
    .y = Name,
    ~unz(.x, .y)
)

对于单次观察,以第一行为例,我可以运行:

dataUnzipped <- unzip(zips[1, 1], list = TRUE, exdir = "unzipDumps")

这给了我:

                      Name  Length                Date
1         hts_code_map.txt 5036228 2021-07-03 01:48:00
2 state_prov_coord_map.txt  117121 2021-07-03 01:48:00
3     mic_exchange_map.txt   47269 2021-07-03 01:48:00
4   state_province_map.txt  108424 2021-07-03 01:48:00
5              sic_map.txt   50354 2021-07-03 01:48:00
6           naics6_map.txt   50964 2021-07-03 01:48:00

然后我运行:

unzFile <- unz(zips[1, 1], dataUnzipped[1, 1])

其中 zips[1, 1]"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip"dataUnzipped[1, 1]hts_code_map.txt。这给了我:

A connection with                                                                                                         
description "C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip:hts_code_map.txt"
class       "unz"                                                                                        
mode        "r"                                                                                          
text        "text"                                                                                       
opened      "closed"                                                                                     
can read    "yes"                                                                                        
can write   "yes"

然后我可以 运行 read.table 给出以下内容:

> read.table(unzFile, sep = "|")
            V1             V2
1     HTS_CODE       HTS_DESC
2   0000000000  NONE PROVIDED
3   0000010000 NONE DISCLOSED
4   0000020000 NONE DISCLOSED
5   0000030000 NONE DISCLOSED

那么我如何 map 遍历每一行并读入 .txt 数据?

数据:

zips <- structure(list(folders = c("C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip", 
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip", 
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip", 
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip"
), Name = c("econ_indicator_map.txt", "entity_sub_type_map.txt", 
"sym_merged_fsym_id.txt", "ent_scr_address.txt"), Length = c(1822, 
925, 1180324, 4506085), Date = structure(c(1625276880, 1625276880, 
1625275560, 1625285460), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L))

怎么样:

library(tidyverse)
map2_df(zips$folders, zips$Name, function(x, y) {
  file <- unz(x, y)
  out <- read.table(unzFile, sep = "|", header = TRUE) # seems you have a header in your example
  out$source_file <- y
  return(out)
})

这应该会给你一个很好的 data.frame,其中有一列告诉你每一行来自哪里 (source_file)。假设文件可以绑定在一起。否则,只需将 map2_df 替换为 map2 即可得到 data.frame 的列表。