如何循环到达每个 class link 并提取 R 中的属性容量席位
How to loop to reach each class link and extract out the attribute capacity seats in R
我实际上想为这个 link 中存在的每个 class
提取 capacity (seats)
属性。这是实际的 link https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
如果 posted link 不起作用:请执行此操作
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search
例如:
对于主题 ARCH
,class 如下所示:
以上图片只是class主题ARCH
的几张图片。尽管如此,还是有很多 classes。如果您单击每个 class,您将看到属性 capacity
,它显示 seats
个数字。
我希望输出如下所示:
classes capacity - seats
Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 15
Design Visualization - 11107 - ARCH 1113 - 001 15
Building Technology 2 - 23840 - ARCH 2412 - 001 20
如何在 R
中进行循环以获取每个 subject
的每个 class
的 capacity (seats)
属性。
P.S。这个问题是我之前post
这个解决方案与之前的解决方案非常相似。
它更直接,因为 link 到 class 大小与 class 标题位于同一节点中。根据您的信息,class 大小 table 在与剩余数据合并之前需要清理。
此外,由于将查询网站上的多个页面,请引入轻微的系统暂停以保持礼貌并避免看起来像黑客。
请注意,没有错误检查来确保正确的 table 可用,我建议您在制作此生产代码之前考虑这一点。
#https://whosebug.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r/64517844#64517844
library(rvest)
library(dplyr)
# In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
# Select by term -> Spring Term 2021 (view only) -> Submit
# Subject -> select ARCH Architecture -> scroll down and click Class Search
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle")
dfs<-lapply(classes, function(class) {
#get class name
classname <-class %>% html_text()
print(classname)
#Pause in order not be a denial of service attach
Sys.sleep(0.5)
classlink <- class %>% html_node("a") %>% html_attr("href")
fulllink <- paste0("https://ssb.bannerprod.memphis.edu", classlink)
newpage <-read_html(fulllink)
#find the tables
tables <- newpage %>% html_nodes("table.datadisplaytable")
#find the index to the correct table
seatingtable <- which(html_attr(tables, "summary") == "This layout table is used to present the seating numbers.")
size <-tables[seatingtable] %>% html_table(header=TRUE)
#may want to clean up table before combining in dataframe
# i.e size[[1]][1, -1]
data.frame(class=classname, size[[1]], link=fulllink)
})
answer <- bind_rows(dfs)
我实际上想为这个 link 中存在的每个 class
提取 capacity (seats)
属性。这是实际的 link https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec
如果 posted link 不起作用:请执行此操作
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search
例如:
对于主题 ARCH
,class 如下所示:
以上图片只是class主题ARCH
的几张图片。尽管如此,还是有很多 classes。如果您单击每个 class,您将看到属性 capacity
,它显示 seats
个数字。
我希望输出如下所示:
classes capacity - seats
Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 15
Design Visualization - 11107 - ARCH 1113 - 001 15
Building Technology 2 - 23840 - ARCH 2412 - 001 20
如何在 R
中进行循环以获取每个 subject
的每个 class
的 capacity (seats)
属性。
P.S。这个问题是我之前post
这个解决方案与之前的解决方案非常相似。
它更直接,因为 link 到 class 大小与 class 标题位于同一节点中。根据您的信息,class 大小 table 在与剩余数据合并之前需要清理。
此外,由于将查询网站上的多个页面,请引入轻微的系统暂停以保持礼貌并避免看起来像黑客。
请注意,没有错误检查来确保正确的 table 可用,我建议您在制作此生产代码之前考虑这一点。
#https://whosebug.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r/64517844#64517844
library(rvest)
library(dplyr)
# In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
# Select by term -> Spring Term 2021 (view only) -> Submit
# Subject -> select ARCH Architecture -> scroll down and click Class Search
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle")
dfs<-lapply(classes, function(class) {
#get class name
classname <-class %>% html_text()
print(classname)
#Pause in order not be a denial of service attach
Sys.sleep(0.5)
classlink <- class %>% html_node("a") %>% html_attr("href")
fulllink <- paste0("https://ssb.bannerprod.memphis.edu", classlink)
newpage <-read_html(fulllink)
#find the tables
tables <- newpage %>% html_nodes("table.datadisplaytable")
#find the index to the correct table
seatingtable <- which(html_attr(tables, "summary") == "This layout table is used to present the seating numbers.")
size <-tables[seatingtable] %>% html_table(header=TRUE)
#may want to clean up table before combining in dataframe
# i.e size[[1]][1, -1]
data.frame(class=classname, size[[1]], link=fulllink)
})
answer <- bind_rows(dfs)