从 R 中的 URL link 抓取所需内容的问题

Problem with web scraping of required content from a URL link in R

我正在使用脚本从 link 中抓取所需的内容,其中有不同的主题。

library(rvest)
url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"

query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

在上面的查询中 sel_subj 每个不同主题的变化

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>% 
  html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")

instructors <- html_attr(instructor_nodes, "target") 
emails <- html_attr(instructor_nodes, "href")

length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39

sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])

结果如下所示,这是错误的:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Marika E. Snider mailto:mesnider@memphis.edu
7       Computer Apps in Design 2 - 11111 - ARCH 2612 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
8                 Design Studio 2 - 11112 - ARCH 2712 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
9                 Design Studio 2 - 15408 - ARCH 2712 - 002      Andrew M. Parks  mailto:amparks@memphis.edu
10  Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
11  Determinants of Modern Design - 27436 - ARCH 3221 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
12            Structural Design 2 - 23837 - ARCH 3322 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
13          Professional Practice - 25097 - ARCH 3431 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
14                Design Studio 4 - 11115 - ARCH 3714 - 001         Sonia Raheel  mailto:sraheel@memphis.edu
15                Design Studio 4 - 23221 - ARCH 3714 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201   Jennifer L. Barker mailto:jlbrker1@memphis.edu
17             Sustainable Design - 19491 - ARCH 4421 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
18     Internship in Architecture - 21000 - ARCH 4430 - 001     Marika E. Snider mailto:mesnider@memphis.edu
19                Design Studio 6 - 11134 - ARCH 4716 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
20             Sustainable Design - 19492 - ARCH 6421 - 001     Marika E. Snider mailto:mesnider@memphis.edu
21      Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001     Marika E. Snider mailto:mesnider@memphis.edu
22    Contemporary Architecture 2 - 24104 - ARCH 7222 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
23     Internship in Architecture - 19495 - ARCH 7430 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
24      Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25       Advanced Design Studio 2 - 18389 - ARCH 7712 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
26          Architecture Research - 25098 - ARCH 7930 - 001     Brian D. Andrews mailto:bdndrews@memphis.edu
27     Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28     Architecture Thesis Studio - 19500 - ARCH 7996 - 004     Brian D. Andrews mailto:bdndrews@memphis.edu
29     Architecture Thesis Studio - 19501 - ARCH 7996 - 005      Andrew M. Parks  mailto:amparks@memphis.edu
30     Architecture Thesis Studio - 19502 - ARCH 7996 - 006     Michael D. Hagge  mailto:mdhagge@memphis.edu
31     Architecture Thesis Studio - 19503 - ARCH 7996 - 007     Brian D. Andrews mailto:bdndrews@memphis.edu
32     Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
34                                                     <NA>   Jennifer L. Barker mailto:jlbrker1@memphis.edu
35                                                     <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
37                                                     <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38                                                     <NA>     Brian D. Andrews mailto:bdndrews@memphis.edu
39                                                     <NA>     Marika E. Snider mailto:mesnider@memphis.edu

但在 link 中,数据看起来不同。
例如:
很少有类 without any instructor and email(提到TBA)如下:

而且很少有其他 类 有 two/three/four/multiple instructors

并且很少有其他 类 与 same instructor given multiple times 如下所示:

对于此类数据,我希望我的输出如下所示:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002          TBA         
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Marika E. Snider mailto:mesnider@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Timothy E. Michael mailto:tmichael@memphis.edu

P.S。如果发布的 URL link 不起作用。请关注:

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search

如何处理缺失数据(TBA)、多位指导员、多次给定同一位指导员?

问题在于使用 html_nodes() 函数。此函数将 return 值列表,而不考虑值是在哪个节点找到的。由于您的网页有时每个 class 或 none 会有多个讲师,因此需要更有针对性的方法。

在这个代码块中,我们首先找到包含我们想要的所有信息的每个 class 节点。然后我们单独解析每个节点(在 lapply 函数内)以提取教师和电子邮件,同时检查空字段。每个教师的每个数据框都有一行,所以如果有多个教师,一些数据框会有多行。

我们 assemble 每个 class 的数据框列表 (bind_rows),然后合并相同 class[=14= 的讲师和电子邮件结果]

library(rvest)
library(dplyr)

url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"

query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") %>% html_text()

classinfo <- html %>% html_nodes("tr td.dddefault")
classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']") 
classinfo <- classinfo[nchar( html_text(classinfo))>50 ]   #eliminate the extra found nodes

classlink <- classinfo %>% html_nodes("a") %>% html_attr("href")  #find all links
classlinktext <- classinfo %>% html_nodes("a") %>% html_text()    #find the link text
classlink <- classlink[classlinktext=="View Catalog Entry"]       #keep only the links for "View Catalog Entry"

dfs <-lapply(1:length(classinfo), function(i) {
 # classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
  instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>% 
    html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
  
  instructors <- html_attr(instructor_node, "target") 
  emails <- html_attr(instructor_node, "href")
  #check to see if instructor was assign if not TBD
  if(length(instructors)==0){
    instructors <- "TBD"
    emails <- "NA"
  }
  data.frame(classname=classes[i], link=classlink[i], instructors, emails)
})
   
#merge list into data frame
answer<- bind_rows(dfs)

#consolidation the instructions in the same class
finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
# the paste(instructors, collapse = ", ") could be contained within the lapply 
# loop but adding it here add some flexibility depending on whether
# answer or final answer is the end result.
head(finalanswer, 16)
tail(finalanswer, 16)