如何从 R 中的 URL link 中提取每个主题的 table - Webscraping

Question

我正在尝试为每个主题抓取 table：

这是主要的 link https://htmlaccess.louisville.edu/classSchedule/setupSearchClassSchedule.cfm?error=0 如下所示：

我必须 select 每个主题并单击搜索，这会转到 link https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm

每个科目给出不同的table。对于主题 Accounting，我尝试获取如下所示的 table：我使用 Selector Gadget Chrome 扩展来获取 html_nodes[=22 的 node string =]

library(rvest)
library(tidyr)
library(dplyr)
library(ggplot2)

url <- "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm"
df <- read_html(url) 

tot <- df %>%
  html_nodes('table+ table td') %>%
  html_text()

但是没用：

## show
tot
character(0)

有没有办法用 R 在代码中为每个主题获取 tables？

Answer 1

您的问题是网站需要提交网络表单 - 当您单击页面上的“搜索”按钮时就会出现这种情况。如果不提交该表单，您将无法访问数据。如果您尝试导航到您要抓取的 link，这一点很明显 - 将其输入您最喜欢的网络浏览器，您会看到“https://[=28=”根本没有表格].edu/classSchedule/searchClassSchedule.cfm”。难怪什么都没有出现！

幸运的是，您可以使用 R 提交 Web 表单。但是，它需要更多的代码。我最喜欢的软件包是 httr，它与 rvest 配合得很好。这是将使用 httr 提交表单的代码，然后继续执行其余代码。

library(rvest)
library(dplyr)
library(httr)

request_body <- list(
  term="4212",
  subject="ACCT", 
  catalognbr="",
  session="none",
  genEdCat="none",
  writingReq="none",
  comBaseCat="none",
  sustainCat="none",
  starttimedir="0",
  starttimehour="08",
  startTimeMinute="00",
  endTimeDir="0",
  endTimeHour="22",
  endTimeMinute="00",
  location="any",
  classstatus="0",
  Search="Search"
)

resp <- httr::POST(
  url = paste0("https://htmlaccess.louisville.edu/class",
               "Schedule/searchClassSchedule.cfm"), 
  encode = "form", 
  body = request_body)
httr::status_code(resp)
df <- httr::content(resp)

tot <- df %>%
  html_nodes("table+ table td") %>%
  html_text() %>%
  matrix(ncol=17, byrow=TRUE)

在我的机器上，returns 一个带有预期数据的格式良好的矩阵。现在，挑战在于弄清楚究竟要在请求正文中放入什么。为此，我使用 Chrome 的“检查”工具（右键单击网页，点击“检查”）。在该侧面板的“网络”选项卡上，您可以跟踪浏览器发送的信息。如果我从主页开始并在“搜索”会计时保持那个侧边标签向上，我会看到最热门的是“searchClassSchedule.cfm”并通过单击它打开它。在那里，您可以看到所有提交到服务器的表单字段，我只是将它们手动复制到 R 中。

你的工作就是找出其他部门使用的简称！ “ACCT”似乎是“会计”的意思。一旦你在向量中获得了这些名称，你就可以使用 for 循环或 lapply 语句遍历它们：

dept_abbrevs <- c("ACCT", "AIRS")
lapply(dept_abbrevs, function(abbrev){
  ...code from above...
  ...after defining message body...
  message_body$subject <- abbrev
  ...rest of the code...
}

如何从 R 中的 URL link 中提取每个主题的 table - Webscraping

How to extract the table for each subject from the URL link in R - Webscraping

html

url

r

web-scraping

rvest