我必须从超过 29 万个网页中抓取纯文本。有没有办法提高速度？

Question

我有一个包含超过 29 万个 url 的矢量，这些 url 指向新闻门户上的文章。这是一个示例：

tempUrls <- c("https://lenta.ru/news/2009/12/31/kids/",
                  "https://lenta.ru/news/2009/12/31/silvio/",
                  "https://lenta.ru/news/2009/12/31/postpone/",
                  "https://lenta.ru/news/2009/12/31/boeviks/",
                  "https://lenta.ru/news/2010/01/01/celebrate/",
                  "https://lenta.ru/news/2010/01/01/aes/")

有一个我用来下载计划文本的代码：

GetPageText <- function(address) {

        webpage <- getURL(address, followLocation = TRUE, .opts = list(timeout = 10))
        pagetree <- htmlTreeParse(webpage, error = function(...) {}, useInternalNodes = TRUE, encoding = "UTF-8")
        node <- getNodeSet(pagetree, "//div[@itemprop='articleBody']/..//p")
        plantext <- xmlSApply(node, xmlValue)
        plantext <- paste(plantext, collapse = "")
        node <- getNodeSet(pagetree, "//title")
        title <- xmlSApply(node, xmlValue)

        return(list(plantext = plantext, title = title))
}

DownloadPlanText <- function() {

        tempUrls <- c("https://lenta.ru/news/2009/12/31/kids/",
                      "https://lenta.ru/news/2009/12/31/silvio/",
                      "https://lenta.ru/news/2009/12/31/postpone/",
                      "https://lenta.ru/news/2009/12/31/boeviks/",
                      "https://lenta.ru/news/2010/01/01/celebrate/",
                      "https://lenta.ru/news/2010/01/01/aes/")

        for (i in 1:length(tempUrls)) {
                print(system.time(GetPageText(tempUrls[i])))
        }
}

这 6 个 link 有一个 system.time 列表：

   user  system elapsed 
  0.081   0.004   3.754 
   user  system elapsed 
  0.061   0.003   3.340 
   user  system elapsed 
  0.069   0.003   3.115 
   user  system elapsed 
  0.059   0.003   3.697 
   user  system elapsed 
  0.068   0.004   2.788 
   user  system elapsed 
  0.061   0.004   3.469

这意味着从 1 link 下载计划文本需要 3 秒。对于 290K link，需要 14500 分钟或 241 小时或 10 天。

有什么办法可以改善吗？

Answer 1

有几种方法可以做到这一点，但我强烈建议保留源页面的副本，因为您可能需要返回并抓取，如果您忘记了某些内容，再次敲打网站是不礼貌的。

进行此归档的最佳方法之一是创建 WARC 文件。我们可以用 wget 来做到这一点。您可以使用自制软件 (brew install wget) 在 macOS 上安装 wget。

创建一个包含要抓取的 URL 的文件，每行一个 URL。例如，这是 lenta.urls:

的内容

https://lenta.ru/news/2009/12/31/kids/
https://lenta.ru/news/2009/12/31/silvio/
https://lenta.ru/news/2009/12/31/postpone/
https://lenta.ru/news/2009/12/31/boeviks/
https://lenta.ru/news/2010/01/01/celebrate/
https://lenta.ru/news/2010/01/01/aes/

在终端，创建一个目录来保存输出并将其作为您的工作目录，因为 wget 不确定地不会删除临时文件（这非常烦人）。在这个新目录中，再次在终端执行：

wget --warc-file=lenta -i lenta.urls

这将以您的 Internet 连接速度进行并检索该文件中所有页面的内容。它不会镜像（因此它不会获取图像等，只是您想要的主页内容）。

由于我提到的那个非确定性错误，现在此目录中可能有许多 index.html[.###] 个文件。在删除它们之前，请备份 lenta.warc.gz，因为您刚刚花了很多时间来获取它，并且还惹恼了那些运行该站点的人，您不想再做一次。说真的，将其复制到一个单独的 drive/file/etc。一旦你做了这个备份（你做了备份，对吧？）你可以而且应该删除那些 index.html[.###] 文件。

我们现在需要读取此文件并提取内容。然而，R 的创建者似乎无法使 gz 文件连接与跨平台的搜索一致地工作，即使有十几个 C/C++ 库可以很好地做到这一点，所以你必须解压缩 lenta.warc.gz 文件（双击它或在终端中执行 gunzip lenta.warc.gz）。

现在您有了要处理的数据，下面是我们需要的一些辅助函数和库：

library(stringi)
library(purrr)
library(rvest)
library(dplyr)

#' get the number of records in a warc request
warc_request_record_count <- function(warc_fle) {

  archive <- file(warc_fle, open="r")

  rec_count <- 0

  while (length(line <- readLines(archive, n=1, warn=FALSE)) > 0) {
    if (grepl("^WARC-Type: request", line)) {
      rec_count <- rec_count + 1
    }
  }

  close(archive)

  rec_count
}

注意：上面的函数是必需的，因为分配我们正在构建的 list 的大小来保存这些具有已知值的记录与动态增长它相比更有效，特别是如果你有那些20 万多个网站可供抓取。

#' create a warc record index of the responses so we can
#' seek right to them and slurp them up
warc_response_index <- function(warc_file,
                                record_count=warc_request_record_count(warc_file)) {

  records <- vector("list", record_count)
  archive <- file(warc_file, open="r")

  idx <- 0
  record <- list(url=NULL, pos=NULL, length=NULL)
  in_request <- FALSE

  while (length(line <- readLines(archive, n=1, warn=FALSE)) > 0) {

    if (grepl("^WARC-Type:", line)) {
      if (grepl("response", line)) {
        if (idx > 0) {
          records[[idx]] <- record
          record <- list(url=NULL, pos=NULL, length=NULL)
        }
        in_request <- TRUE
        idx <- idx + 1
      } else {
        in_request <- FALSE
      }
    }

    if (in_request & grepl("^WARC-Target-URI:", line)) {
      record$url <- stri_match_first_regex(line, "^WARC-Target-URI: (.*)")[,2]
    }

    if (in_request & grepl("^Content-Length:", line)) {
      record$length <- as.numeric(stri_match_first_regex(line, "Content-Length: ([[:digit:]]+)")[,2])
      record$pos <- as.numeric(seek(archive, NA))
    }

  }

  close(archive)

  records[[idx]] <- record

  records

}

注意：该函数提供了网站响应的位置，因此我们可以超快地访问它们。

#' retrieve an individual response record
get_warc_response <- function(warc_file, pos, length) {

  archive <- file(warc_file, open="r")

  seek(archive, pos)
  record <- readChar(archive, length)

  record <- stri_split_fixed(record, "\r\n\r\n", 2)[[1]]
  names(record) <- c("header", "page")

  close(archive)

  as.list(record)

}

现在，要浏览所有这些页面，就这么简单：

warc_file <- "~/data/lenta.warc"

responses <- warc_response_index(warc_file)

嗯，这只是获取 WARC 文件中所有页面的位置。以下是如何以漂亮、整洁的方式获取您需要的内容，data.frame:

map_df(responses, function(r) {

  resp <- get_warc_response(warc_file, r$pos, r$length)

  # the wget WARC response is sticking a numeric value as the first
  # line for URLs from this site (and it's not a byte-order-mark). so,
  # we need to strip that off before reading in the actual response.
  # i'm pretty sure it's the site injecting this and not wget since i
  # don't see it on other test URLs I ran through this for testing.

  pg <- read_html(stri_split_fixed(resp$page, "\r\n", 2)[[1]][2])

  html_nodes(pg, xpath=".//div[@itemprop='articleBody']/..//p") %>%
    html_text() %>%
    paste0(collapse="") -> plantext

  title <- html_text(html_nodes(pg, xpath=".//head/title"))

  data.frame(url=r$url, title, plantext, stringsAsFactors=FALSE)

}) -> df

而且，我们可以看看它是否有效：

dplyr::glimpse(df)
## Observations: 6
## Variables: 3
## $ url      <chr> "https://lenta.ru/news/2009/12/31/kids/", "https://lenta.ru/news/2009/...
## $ title    <chr> "Новым детским омбудсменом стал телеведущий Павел Астахов: Россия: Len...
## $ plantext <chr> "Президент РФ Дмитрий Медведев назначил нового уполномоченного по прав...

我相信其他人会对您有想法（在命令行使用 GNU parallel 和 wget 或 curl 或使用 [=32= 的并行版本] 与您现有的代码），但这个过程最终对网站提供商更友好，并在本地保留内容的副本以供进一步处理。此外，它采用 ISO 标准格式的 Web 存档，有很多很多工具可以处理（很快也会有一些在 R 中）。

像这样对文件 seeking/slurping 使用 R 很糟糕，但我的 WARC 文件包还没有准备好。它是 C++ 支持的，所以它很多 faster/efficient，但它超出了 SO 答案的范围，只为这个答案添加那么多内联 C++ 代码。

即使使用我放在这里的这种方法，我也会将 URL 分成块并分批处理它们，以便对网站有益并避免在事件中重新抓取你的连接在这中间中断了。

精明的 wget 人会问我为什么不在这里使用 cdx 选项，这主要是为了避免复杂性，而且它对实际数据处理也有点无用，因为 R 代码有无论如何都要寻求记录。使用 cdx 选项（执行 man wget 以查看我指的是什么）可以重新启动中断的 WARC 抓取，但你必须小心处理它，所以我只是为简单起见，省略了其中的细节。

对于您拥有的网站数量，请查看 dplyr 中的 progress_estimated() 函数，并考虑在 map_df 代码中添加一个进度条。

我必须从超过 29 万个网页中抓取纯文本。有没有办法提高速度？

I have to grab plantext from over 290K webpages. Is there a way to improve the speed?

parallel-processing

r

html-parsing