根据 R 中特定词的 presence/absence 对 PDF 文本文档进行分类
Classifying PDF Text Documents based on the presence/absence of specific words in R
我希望能够将 PDF 文档导入 R 并将它们分类为:
- 相关(包含特定字符串,例如"tacos",前100个字)
- 不相关(不在前 100 个单词中包含 "tacos")
更具体地说,我想解决以下问题:
- R 中是否存在执行此基本分类的程序包?
- 如果是这样,如果我有 2 个带有 Paper1 的 PDF 文档在前 100 个中至少包含一个字符串实例 "tacos",是否有可能生成一个在 R 中看起来像这样的数据集不包含至少一个字符串实例的单词和 Paper2,"tacos":
任何对 documentation/R packages/sample R 代码的引用或与此类分类相关的模拟示例 使用 R 将不胜感激!谢谢!
您可以使用 pdftools
库并执行如下操作:
首先,加载库并获取一些 pdf 文件名:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
然后定义一个函数,以文本形式读取 PDF 文件并查找前 n
个单词。 (检查错误可能很有用,例如 unknown password 或类似的东西 - 我的前函数 returns NA
用于此类情况。)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
最后把它包起来放到数据框中:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
虽然@lukeA 击败了我,但我编写了另一个也使用 pdftools 的小函数。唯一真正的区别是 lukeA 查看前 n
个字符,而我的脚本查看前 n
个单词。
这就是我的方法
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant
我希望能够将 PDF 文档导入 R 并将它们分类为:
- 相关(包含特定字符串,例如"tacos",前100个字)
- 不相关(不在前 100 个单词中包含 "tacos")
更具体地说,我想解决以下问题:
- R 中是否存在执行此基本分类的程序包?
- 如果是这样,如果我有 2 个带有 Paper1 的 PDF 文档在前 100 个中至少包含一个字符串实例 "tacos",是否有可能生成一个在 R 中看起来像这样的数据集不包含至少一个字符串实例的单词和 Paper2,"tacos":
任何对 documentation/R packages/sample R 代码的引用或与此类分类相关的模拟示例 使用 R 将不胜感激!谢谢!
您可以使用 pdftools
库并执行如下操作:
首先,加载库并获取一些 pdf 文件名:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
然后定义一个函数,以文本形式读取 PDF 文件并查找前 n
个单词。 (检查错误可能很有用,例如 unknown password 或类似的东西 - 我的前函数 returns NA
用于此类情况。)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
最后把它包起来放到数据框中:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
虽然@lukeA 击败了我,但我编写了另一个也使用 pdftools 的小函数。唯一真正的区别是 lukeA 查看前 n
个字符,而我的脚本查看前 n
个单词。
这就是我的方法
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant