r 中的网页抓取（带循环）

Question

我正在尝试通过网络抓取 Obama 的 spechees 页面，以创建词云等内容。当我尝试为 1、5、10 个不同的页面（演讲）而不是在循环中单独执行此操作时，代码有效。但是通过我创建的这个循环（上面），生成的对象不包含任何内容 (NULL).

有人可以帮帮我吗？

library(wordcloud)
library(tm)
library(XML)
library(RCurl)

site <- "http://obamaspeeches.com/"
url <- readLines(site)

h <- htmlTreeParse(file = url, asText = TRUE, useInternalNodes = TRUE, 
    encoding = "utf-8")

# getting the phrases that will form the web adresses for the speeches
teste <- data.frame(h[42:269, ])
teste2 <- teste[grep("href=", teste$h.42.269...), ]
teste2 <- as.data.frame(teste2)
teste3 <- gsub("^.*href=", "", teste2[, "teste2"])
teste3 <- as.data.frame(teste3)
teste4 <- gsub("^/", "", teste3[, "teste3"])
teste4 <- as.data.frame(teste4)
teste5 <- gsub(">.*$", "", teste4[, "teste4"])
teste5 <- as.data.frame(teste5)

# loop to read pages

l <- vector(mode = "list", length = nrow(teste5))
i <- 1
for (i in nrow(teste5)) {
    site <- paste("http://obamaspeeches.com/", teste5[i, ], sep = "")
    url <- readLines(site)
    l[[i]] <- url
    i <- i + 1
}

str(l)

Answer 1

rvest 包通过抓取和解析使这个过程变得相当简单，尽管可能需要一些 CSS 或 XPath 选择器的知识。这是一种比在 HTML 上使用正则表达式更好的方法，不鼓励使用正则表达式支持适当的 HTML 解析器（如 rvest！）。

如果您要抓取一堆子页面，您可以制作一个 URL 向量，然后 lapply 跨越它以抓取和解析每个页面。这种方法（相对于 for 循环）的优点是它 returns 一个列表，每次迭代都有一个项目，which will be much easier to deal with afterwards。如果你想使用完整的 Hadleyverse，你可以使用 purrr::map，这样你就可以把它变成一个大的顺序链。

library(rvest)

baseurl <- 'http://obamaspeeches.com/' 

         # For this website, get the HTML,
links <- baseurl %>% read_html() %>% 
    # select <a> nodes that are children of <table> nodes that are aligned left,
    html_nodes(xpath = '//table[@align="left"]//a') %>% 
    # and get the href (link) attribute of that node.
    html_attr('href')

            # Loop across the links vector, applying a function that
speeches <- lapply(links, function(url){
    # pastes the ULR to the base URL,
    paste0(baseurl, url) %>% 
    # fetches the HTML for that page,
    read_html() %>% 
    # selects <table> nodes with a width of 610,
    html_nodes(xpath = '//table[@width="610"]') %>% 
    # get the text, trimming whitespace on the ends,
    html_text(trim = TRUE) %>% 
    # and break the text back into lines, trimming excess whitespace for each.
    textConnection() %>% readLines() %>% trimws()
})

r 中的网页抓取（带循环）

web-scraping in r (with loop)

parsing

r

html-parsing

web-scraping