从 R 中的 url link Web 抓取所需内容
Web scraping the required content from a url link in R
我对网络抓取和尝试从 link 抓取所需内容非常陌生。
这是上图的实际 URL:https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
我希望输出如下所示:
Sections Found Instructors email id
Academic Strategies - 10582 - ACAD 1100 - 001 Beverly McPhail
Academic Strategies - 10586 - ACAD 1100 - 002 Emily K Mann
Academic Strategies - 10590 - ACAD 1100 - 005 Christopher D Bourque
我看email id
不可见,我只能看到符号。我在 R 中看到了 rvest
包并开始使用如下所示,但我看到一个错误:
library(rvest)
url <- read_html("https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec")
Error in open.connection(x, "rb") : HTTP error 500.
去图中的数据:
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ACAD Academics -> scroll down and click Class Search
这会将您带到 link https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
我可以知道如何在 R 中进行这种类型的抓取吗?谢谢
这很棘手。只有在服务器收到具有适当形式的 POST 请求后才会提供网页,因此这不是像 read_html
那样向 url 发送普通 GET 请求的简单情况。您需要“手动”构建 POST 请求以获得您想要的页面。
library(rvest)
#> Loading required package: xml2
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ACAD",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
获得 html 后,您可以使用 xpath 获取要抓取的节点:
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>%
html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_nodes, "target")
emails <- html_attr(instructor_nodes, "href")
df <- data.frame(classes, instructors, emails)
df
#> classes instructors
#> 1 Academic Strategies - 10582 - ACAD 1100 - 001 Beverly McPhail
#> 2 Academic Strategies - 10586 - ACAD 1100 - 002 Emily K. Mann
#> 3 Academic Strategies - 10590 - ACAD 1100 - 005 Christopher D. Bourque
#> emails
#> 1 mailto:blahblah@memphis.edu
#> 2 mailto:blahbl@memphis.edu
#> 3 mailto:blahblah@memphis.edu
请注意,我显然掩盖了相关人员的 e-mail 地址,而不是未经他们同意将其发布在 public 网页上。
我对网络抓取和尝试从 link 抓取所需内容非常陌生。
这是上图的实际 URL:https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
我希望输出如下所示:
Sections Found Instructors email id
Academic Strategies - 10582 - ACAD 1100 - 001 Beverly McPhail
Academic Strategies - 10586 - ACAD 1100 - 002 Emily K Mann
Academic Strategies - 10590 - ACAD 1100 - 005 Christopher D Bourque
我看email id
不可见,我只能看到符号。我在 R 中看到了 rvest
包并开始使用如下所示,但我看到一个错误:
library(rvest)
url <- read_html("https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec")
Error in open.connection(x, "rb") : HTTP error 500.
去图中的数据:
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ACAD Academics -> scroll down and click Class Search
这会将您带到 link https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
我可以知道如何在 R 中进行这种类型的抓取吗?谢谢
这很棘手。只有在服务器收到具有适当形式的 POST 请求后才会提供网页,因此这不是像 read_html
那样向 url 发送普通 GET 请求的简单情况。您需要“手动”构建 POST 请求以获得您想要的页面。
library(rvest)
#> Loading required package: xml2
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ACAD",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
获得 html 后,您可以使用 xpath 获取要抓取的节点:
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>%
html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_nodes, "target")
emails <- html_attr(instructor_nodes, "href")
df <- data.frame(classes, instructors, emails)
df
#> classes instructors
#> 1 Academic Strategies - 10582 - ACAD 1100 - 001 Beverly McPhail
#> 2 Academic Strategies - 10586 - ACAD 1100 - 002 Emily K. Mann
#> 3 Academic Strategies - 10590 - ACAD 1100 - 005 Christopher D. Bourque
#> emails
#> 1 mailto:blahblah@memphis.edu
#> 2 mailto:blahbl@memphis.edu
#> 3 mailto:blahblah@memphis.edu
请注意,我显然掩盖了相关人员的 e-mail 地址,而不是未经他们同意将其发布在 public 网页上。