根据 R 中特定词的 presence/absence 对 PDF 文本文档进行分类

Question

我希望能够将 PDF 文档导入 R 并将它们分类为：

相关（包含特定字符串，例如"tacos"，前100个字）
不相关（不在前 100 个单词中包含 "tacos"）

更具体地说，我想解决以下问题：

R 中是否存在执行此基本分类的程序包？
如果是这样，如果我有 2 个带有 Paper1 的 PDF 文档在前 100 个中至少包含一个字符串实例 "tacos"，是否有可能生成一个在 R 中看起来像这样的数据集不包含至少一个字符串实例的单词和 Paper2，"tacos":

任何对 documentation/R packages/sample R 代码的引用或与此类分类相关的模拟示例 使用 R 将不胜感激！谢谢！

Answer 1

您可以使用 pdftools 库并执行如下操作：

首先，加载库并获取一些 pdf 文件名：

library(pdftools)
fns <- list.files("~/Documents", pattern = "\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...

然后定义一个函数，以文本形式读取 PDF 文件并查找前 n 个单词。（检查错误可能很有用，例如 unknown password 或类似的东西 - 我的前函数 returns NA 用于此类情况。）

isRelevant <- function(fn, needle, n = 100L, ...) {
  res <- try({
    txt <- pdf_text(fn)
    txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE) 
    any(grepl(needle, txt[1:n], ...))
  }, silent = TRUE)
  if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)

最后把它包起来放到数据框中：

data.frame(
  Document = basename(fns), 
  Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
#   Document  Classification
# 1    a.pdf        relevant
# 2    b.pdf    not relevant
# 3    c.pdf        relevant
# 4    d.pdf    not relevant
# 5    e.pdf        relevant

Answer 2

虽然@lukeA 击败了我，但我编写了另一个也使用 pdftools 的小函数。唯一真正的区别是 lukeA 查看前 n 个字符，而我的脚本查看前 n 个单词。

这就是我的方法

library(pdftools)
library(dplyr) # for data_frames and bind_rows

# to find the files better
setwd("~/Desktop/pdftask/")

# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)


# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
  # loop over the files 
  res_list <- lapply(pdf_files, function(file) {
    # use the library pdftools::pdf_text to extract the text from the pdf
    content <- pdf_text(file)

    # do some cleanup, i.e., remove punctuation, new-lines and lower all letters
    content2 <- tolower(content)
    content2 <- gsub("\n", "", content2)
    content2 <- gsub("[[:punct:]]", "", content2)

    # split up the text by spaces
    content_vec <- strsplit(content2, " ")[[1]]

    # look if the search term is within the first n_words words
    found <- search_term %in% content_vec[1:n_words]

    # create a data_frame that holds our data
    res <- data_frame(file = file, 
                      relevance = ifelse(found, 
                                         "Relevant",
                                         "Irrelevant"))
    return(res)
  }) 

  # bind the data to a "tidy" data_frame
  res_df <- bind_rows(res_list)
  return(res_df)
}

search_pdf(pdf_files, search_term = "taco", n_words = 100)

# # A tibble: 3 × 2
#                          file  relevance
#                         <chr>      <chr>
# 1         pdfs//pdf_empty.pdf Irrelevant
# 2         pdfs//pdf_taco1.pdf   Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant

根据 R 中特定词的 presence/absence 对 PDF 文本文档进行分类

Classifying PDF Text Documents based on the presence/absence of specific words in R

r

text-mining