在 R 中将 PDF table 转换为 data.frame...table 到 data.frame
converting PDF table to data.frame in R...table to data.frame
我正在努力创建一个自动化流程,以从年度 PDF 报告中提取 tables。理想情况下,我能够获取每年的报告,从其中的 table 中提取数据,将所有年份合并到一个大数据框架中,然后对其进行分析。这是我目前所掌握的(只关注一年的报告):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\s+", ",", gsub("^\s+|\s+$", "",data))
我的问题是将这些原始 table 数据转换成一个数据帧,该数据帧的每一行都有每个状态,每一列都有各自的值。我确信解决方案很简单,但我只是 R 的新手!有帮助吗?
编辑:所有这些解决方案都非常棒并且运行良好。但是,当我尝试另一年的报告时,出现了一些错误:
: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
为下一份报告尝试此代码后:
将txt09_11转换为数据框进行分析
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\s{2}\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
你的 gsub
有点过于激进了。你的 data[-c(1,...)]
一切都很好,所以我会从那里接听,将你所有的电话替换为 gsub
:
# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* ,779 ,992 [=10=] ,771 ,496 ,483 [=10=] ,979 ,612 ,604 [=10=] ,216
# 2 Maine* 746 1,767 267 2,780 753 1,510 270 2,533 776 1,605 274 2,655
# 3 Massachusetts 6,359 5,542 143 12,044 6,953 6,771 174 13,898 7,411 7,463 292 15,166
# 4 New Hampshire 491 660 175 1,326 515 936 166 1,617 523 1,197 238 1,958
# 5 Rhode Island 998 1,190 31 2,219 998 1,435 24 2,457 953 1,527 22 2,502
# 6 Vermont* 282 797 332 1,411 302 923 326 1,551 337 948 338 1,623
注意:我使用的宽度 (35,15,20,...) 是仓促得出的,虽然我认为它们有效,但不可否认我没有逐行验证我没有砍东西。 请验证!
最后,您可能想从这里删除 $
和 ,
并进行整数化,这非常简单:
dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216
# 2 Maine* 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502
# 6 Vermont* 282 797 332 1411 302 923 326 1551 337 948 338 1623
我猜州名中的星号是有意义的。这可以使用 grepl
轻松捕获,然后删除:
dat3$ast <- grepl("\*", dat3$V1)
dat3[[1]] <- gsub("\*", "", dat3[[1]])
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 ast
# 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216 TRUE
# 2 Maine 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655 TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958 FALSE
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502 FALSE
# 6 Vermont 282 797 332 1411 302 923 326 1551 337 948 338 1623 TRUE
readr::read_fwf
有一个 fwf_empty
实用程序可以为您猜测列宽,这使工作变得更加简单:
library(tidyverse)
df <- txt14_16[56] %>%
read_lines() %>% # separate lines
grep('^\s{2}\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
df
#> # A tibble: 50 × 13
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612
#> 2 Maine 746 1767 267 2780 753 1510 270 2533 776
#> 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411
#> 4 New Hampshire 491 660 175 1326 515 936 166 1617 523
#> 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953
#> 6 Vermont 282 797 332 1411 302 923 326 1551 337
#> 7 Delaware 662 1001 0 1663 668 1193 14 1875 689
#> 8 Maryland 2893 4807 860 8560 2896 5686 1061 9643 2812
#> 9 New Jersey 3961 6920 1043 11924 3831 8899 1053 13783 3955
#> 10 New York 10981 24237 4754 39972 11161 29393 5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> # X13 <dbl>
显然仍然需要添加列名,但此时数据已经相当可用了。
我正在努力创建一个自动化流程,以从年度 PDF 报告中提取 tables。理想情况下,我能够获取每年的报告,从其中的 table 中提取数据,将所有年份合并到一个大数据框架中,然后对其进行分析。这是我目前所掌握的(只关注一年的报告):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\s+", ",", gsub("^\s+|\s+$", "",data))
我的问题是将这些原始 table 数据转换成一个数据帧,该数据帧的每一行都有每个状态,每一列都有各自的值。我确信解决方案很简单,但我只是 R 的新手!有帮助吗?
编辑:所有这些解决方案都非常棒并且运行良好。但是,当我尝试另一年的报告时,出现了一些错误:
: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
为下一份报告尝试此代码后:
将txt09_11转换为数据框进行分析
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\s{2}\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
你的 gsub
有点过于激进了。你的 data[-c(1,...)]
一切都很好,所以我会从那里接听,将你所有的电话替换为 gsub
:
# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* ,779 ,992 [=10=] ,771 ,496 ,483 [=10=] ,979 ,612 ,604 [=10=] ,216
# 2 Maine* 746 1,767 267 2,780 753 1,510 270 2,533 776 1,605 274 2,655
# 3 Massachusetts 6,359 5,542 143 12,044 6,953 6,771 174 13,898 7,411 7,463 292 15,166
# 4 New Hampshire 491 660 175 1,326 515 936 166 1,617 523 1,197 238 1,958
# 5 Rhode Island 998 1,190 31 2,219 998 1,435 24 2,457 953 1,527 22 2,502
# 6 Vermont* 282 797 332 1,411 302 923 326 1,551 337 948 338 1,623
注意:我使用的宽度 (35,15,20,...) 是仓促得出的,虽然我认为它们有效,但不可否认我没有逐行验证我没有砍东西。 请验证!
最后,您可能想从这里删除 $
和 ,
并进行整数化,这非常简单:
dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216
# 2 Maine* 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502
# 6 Vermont* 282 797 332 1411 302 923 326 1551 337 948 338 1623
我猜州名中的星号是有意义的。这可以使用 grepl
轻松捕获,然后删除:
dat3$ast <- grepl("\*", dat3$V1)
dat3[[1]] <- gsub("\*", "", dat3[[1]])
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 ast
# 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216 TRUE
# 2 Maine 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655 TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958 FALSE
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502 FALSE
# 6 Vermont 282 797 332 1411 302 923 326 1551 337 948 338 1623 TRUE
readr::read_fwf
有一个 fwf_empty
实用程序可以为您猜测列宽,这使工作变得更加简单:
library(tidyverse)
df <- txt14_16[56] %>%
read_lines() %>% # separate lines
grep('^\s{2}\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
df
#> # A tibble: 50 × 13
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612
#> 2 Maine 746 1767 267 2780 753 1510 270 2533 776
#> 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411
#> 4 New Hampshire 491 660 175 1326 515 936 166 1617 523
#> 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953
#> 6 Vermont 282 797 332 1411 302 923 326 1551 337
#> 7 Delaware 662 1001 0 1663 668 1193 14 1875 689
#> 8 Maryland 2893 4807 860 8560 2896 5686 1061 9643 2812
#> 9 New Jersey 3961 6920 1043 11924 3831 8899 1053 13783 3955
#> 10 New York 10981 24237 4754 39972 11161 29393 5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> # X13 <dbl>
显然仍然需要添加列名,但此时数据已经相当可用了。