map2 over columns 并将 .txt 数据读入 R 存储为列表元素
map2 over columns and read .txt data into R storing as a list element
我有以下数据:
folders Name
1 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip econ_indicator_map.txt
2 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip entity_sub_type_map.txt
3 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip sym_merged_fsym_id.txt
4 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip ent_scr_address.txt
Length Date
1 1822 2021-07-03 01:48:00
2 925 2021-07-03 01:48:00
3 1180324 2021-07-03 01:26:00
4 4506085 2021-07-03 04:11:00
我想 map2
遍历 folders
和 Name
列并将文件解压缩到每个 zip 文件夹中。
因此,在文件夹 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip
(第 1 行和第 2 行)中,它包含 2 个 .txt
文件,我想使用 read.table
.
读取这些文件
以下无效。
zips %>%
map2(
.x = folders,
.y = Name,
~unz(.x, .y)
)
对于单次观察,以第一行为例,我可以运行:
dataUnzipped <- unzip(zips[1, 1], list = TRUE, exdir = "unzipDumps")
这给了我:
Name Length Date
1 hts_code_map.txt 5036228 2021-07-03 01:48:00
2 state_prov_coord_map.txt 117121 2021-07-03 01:48:00
3 mic_exchange_map.txt 47269 2021-07-03 01:48:00
4 state_province_map.txt 108424 2021-07-03 01:48:00
5 sic_map.txt 50354 2021-07-03 01:48:00
6 naics6_map.txt 50964 2021-07-03 01:48:00
然后我运行:
unzFile <- unz(zips[1, 1], dataUnzipped[1, 1])
其中 zips[1, 1]
是 "C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip"
,dataUnzipped[1, 1]
是 hts_code_map.txt
。这给了我:
A connection with
description "C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip:hts_code_map.txt"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
然后我可以 运行 read.table
给出以下内容:
> read.table(unzFile, sep = "|")
V1 V2
1 HTS_CODE HTS_DESC
2 0000000000 NONE PROVIDED
3 0000010000 NONE DISCLOSED
4 0000020000 NONE DISCLOSED
5 0000030000 NONE DISCLOSED
那么我如何 map
遍历每一行并读入 .txt
数据?
数据:
zips <- structure(list(folders = c("C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip"
), Name = c("econ_indicator_map.txt", "entity_sub_type_map.txt",
"sym_merged_fsym_id.txt", "ent_scr_address.txt"), Length = c(1822,
925, 1180324, 4506085), Date = structure(c(1625276880, 1625276880,
1625275560, 1625285460), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L))
怎么样:
library(tidyverse)
map2_df(zips$folders, zips$Name, function(x, y) {
file <- unz(x, y)
out <- read.table(unzFile, sep = "|", header = TRUE) # seems you have a header in your example
out$source_file <- y
return(out)
})
这应该会给你一个很好的 data.frame
,其中有一列告诉你每一行来自哪里 (source_file
)。假设文件可以绑定在一起。否则,只需将 map2_df
替换为 map2
即可得到 data.frame
的列表。
我有以下数据:
folders Name
1 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip econ_indicator_map.txt
2 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip entity_sub_type_map.txt
3 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip sym_merged_fsym_id.txt
4 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip ent_scr_address.txt
Length Date
1 1822 2021-07-03 01:48:00
2 925 2021-07-03 01:48:00
3 1180324 2021-07-03 01:26:00
4 4506085 2021-07-03 04:11:00
我想 map2
遍历 folders
和 Name
列并将文件解压缩到每个 zip 文件夹中。
因此,在文件夹 C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip
(第 1 行和第 2 行)中,它包含 2 个 .txt
文件,我想使用 read.table
.
以下无效。
zips %>%
map2(
.x = folders,
.y = Name,
~unz(.x, .y)
)
对于单次观察,以第一行为例,我可以运行:
dataUnzipped <- unzip(zips[1, 1], list = TRUE, exdir = "unzipDumps")
这给了我:
Name Length Date
1 hts_code_map.txt 5036228 2021-07-03 01:48:00
2 state_prov_coord_map.txt 117121 2021-07-03 01:48:00
3 mic_exchange_map.txt 47269 2021-07-03 01:48:00
4 state_province_map.txt 108424 2021-07-03 01:48:00
5 sic_map.txt 50354 2021-07-03 01:48:00
6 naics6_map.txt 50964 2021-07-03 01:48:00
然后我运行:
unzFile <- unz(zips[1, 1], dataUnzipped[1, 1])
其中 zips[1, 1]
是 "C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip"
,dataUnzipped[1, 1]
是 hts_code_map.txt
。这给了我:
A connection with
description "C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip:hts_code_map.txt"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
然后我可以 运行 read.table
给出以下内容:
> read.table(unzFile, sep = "|")
V1 V2
1 HTS_CODE HTS_DESC
2 0000000000 NONE PROVIDED
3 0000010000 NONE DISCLOSED
4 0000020000 NONE DISCLOSED
5 0000030000 NONE DISCLOSED
那么我如何 map
遍历每一行并读入 .txt
数据?
数据:
zips <- structure(list(folders = c("C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ref_hub_v2_full_1986.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/sym_merged_fsym_id_v1_full_4050.zip",
"C:/Users/bscuser/Desktop/FactSet/FactSetData/zips/ent_supply_chain_hub_v1_full_2264.zip"
), Name = c("econ_indicator_map.txt", "entity_sub_type_map.txt",
"sym_merged_fsym_id.txt", "ent_scr_address.txt"), Length = c(1822,
925, 1180324, 4506085), Date = structure(c(1625276880, 1625276880,
1625275560, 1625285460), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L))
怎么样:
library(tidyverse)
map2_df(zips$folders, zips$Name, function(x, y) {
file <- unz(x, y)
out <- read.table(unzFile, sep = "|", header = TRUE) # seems you have a header in your example
out$source_file <- y
return(out)
})
这应该会给你一个很好的 data.frame
,其中有一列告诉你每一行来自哪里 (source_file
)。假设文件可以绑定在一起。否则,只需将 map2_df
替换为 map2
即可得到 data.frame
的列表。