使用数据框中的值指定 read_csv 中的列类型
Specify Column types in read_csv by using values in a dataframe
我正在尝试读取包含多个 csv 文件的目录,每个文件大约有 7K+ 行和 ~ 1800 列。我有一个可以读入数据框的数据字典,其中数据字典的每一行都标识变量(列)名称以及数据类型。
查看 readr
包中的 ?read_csv
,可以指定列类型。但是,鉴于我有将近 1800 列要指定,我希望使用可用数据字典中的信息以函数所需的正确格式指定 column/type 对。
另一种不太理想的方法是将每一列作为一个字符读入,然后根据需要手动修改。
如果您能提供有关如何指定列类型的任何帮助,我们将不胜感激。
如果有帮助,这是我的代码,用于获取数据字典并将其转化为我所指的格式。
## Get the data dictionary
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx"
download.file(URL, destfile="raw-data/dictionary.xlsx")
## read in the dictionary to get the variables
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
dict = dict %>% filter(!is.na(variable_name))
## create a data dictionary
##
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
api_data_type == "autocomplete" ~ "c", #assumption that this is a string
api_data_type == "string" ~ "c",
api_data_type == "float" ~ "d"))
returns :
> ## read in the dictionary to get the variables
> dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
> colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
> dict = dict %>% filter(!is.na(variable_name))
> dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
+ api_data_type == "autocomplete" ~ "c", #assumption that this is a string
+ api_data_type == "string" ~ "c",
+ api_data_type == "float" ~ "d"))
Error: object 'api_data_type' not found
和我的会话信息
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.2.0 readxl_0.1.1 readr_1.1.0 dplyr_0.5.0
loaded via a namespace (and not attached):
[1] rjson_0.2.15 lazyeval_0.2.0 magrittr_1.5 R6_2.2.2 assertthat_0.1 hms_0.2 DBI_0.7 tools_3.3.1
[9] tibble_1.2 yaml_2.1.14 Rcpp_0.12.11 stringi_1.1.5 jsonlite_1.5
您可以结合使用 mutate
和 case_when
以使用紧凑的字符串表示形式映射 api_data_type
列。这是每个列类型由单个字母表示的地方:c = 字符,i = 整数,n = 数字,d = double,l = 逻辑等。现在可以在 col_types
参数中使用此字符向量read_csv
的。
## Load libraries
library(dplyr)
library(readxl)
## Get the data dictionary
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx"
download.file(URL, destfile="raw-data/dictionary.xlsx")
## read in the dictionary to get the variables
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
dict = dict %>% filter(!is.na(variable_name))
unique(dict$api_data_type)
#> [1] "integer" "autocomplete" "string" "float"
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
api_data_type == "autocomplete" ~ "c", #assumption that this is a string
api_data_type == "string" ~ "c",
api_data_type == "float" ~ "d"
)
)
cnames <- dict %>% select(variable_name) %>% pull
head(cnames)
#> [1] "UNITID" "OPEID" "OPEID6" "INSTNM" "CITY" "STABBR"
ctypes <- dict %>% select(variable_type) %>% pull
head(ctypes)
#> [1] "i" "i" "i" "c" "c" "c"
我正在尝试读取包含多个 csv 文件的目录,每个文件大约有 7K+ 行和 ~ 1800 列。我有一个可以读入数据框的数据字典,其中数据字典的每一行都标识变量(列)名称以及数据类型。
查看 readr
包中的 ?read_csv
,可以指定列类型。但是,鉴于我有将近 1800 列要指定,我希望使用可用数据字典中的信息以函数所需的正确格式指定 column/type 对。
另一种不太理想的方法是将每一列作为一个字符读入,然后根据需要手动修改。
如果您能提供有关如何指定列类型的任何帮助,我们将不胜感激。
如果有帮助,这是我的代码,用于获取数据字典并将其转化为我所指的格式。
## Get the data dictionary
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx"
download.file(URL, destfile="raw-data/dictionary.xlsx")
## read in the dictionary to get the variables
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
dict = dict %>% filter(!is.na(variable_name))
## create a data dictionary
##
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
api_data_type == "autocomplete" ~ "c", #assumption that this is a string
api_data_type == "string" ~ "c",
api_data_type == "float" ~ "d"))
returns :
> ## read in the dictionary to get the variables
> dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
> colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
> dict = dict %>% filter(!is.na(variable_name))
> dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
+ api_data_type == "autocomplete" ~ "c", #assumption that this is a string
+ api_data_type == "string" ~ "c",
+ api_data_type == "float" ~ "d"))
Error: object 'api_data_type' not found
和我的会话信息
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.2.0 readxl_0.1.1 readr_1.1.0 dplyr_0.5.0
loaded via a namespace (and not attached):
[1] rjson_0.2.15 lazyeval_0.2.0 magrittr_1.5 R6_2.2.2 assertthat_0.1 hms_0.2 DBI_0.7 tools_3.3.1
[9] tibble_1.2 yaml_2.1.14 Rcpp_0.12.11 stringi_1.1.5 jsonlite_1.5
您可以结合使用 mutate
和 case_when
以使用紧凑的字符串表示形式映射 api_data_type
列。这是每个列类型由单个字母表示的地方:c = 字符,i = 整数,n = 数字,d = double,l = 逻辑等。现在可以在 col_types
参数中使用此字符向量read_csv
的。
## Load libraries
library(dplyr)
library(readxl)
## Get the data dictionary
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx"
download.file(URL, destfile="raw-data/dictionary.xlsx")
## read in the dictionary to get the variables
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
dict = dict %>% filter(!is.na(variable_name))
unique(dict$api_data_type)
#> [1] "integer" "autocomplete" "string" "float"
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
api_data_type == "autocomplete" ~ "c", #assumption that this is a string
api_data_type == "string" ~ "c",
api_data_type == "float" ~ "d"
)
)
cnames <- dict %>% select(variable_name) %>% pull
head(cnames)
#> [1] "UNITID" "OPEID" "OPEID6" "INSTNM" "CITY" "STABBR"
ctypes <- dict %>% select(variable_type) %>% pull
head(ctypes)
#> [1] "i" "i" "i" "c" "c" "c"