从 R 中的 URL link 抓取所需内容的问题
Problem with web scraping of required content from a URL link in R
我正在使用脚本从 link 中抓取所需的内容,其中有不同的主题。
library(rvest)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
在上面的查询中 sel_subj
每个不同主题的变化
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>%
html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_nodes, "target")
emails <- html_attr(instructor_nodes, "href")
length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39
sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])
结果如下所示,这是错误的:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Marika E. Snider mailto:mesnider@memphis.edu
7 Computer Apps in Design 2 - 11111 - ARCH 2612 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
8 Design Studio 2 - 11112 - ARCH 2712 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
9 Design Studio 2 - 15408 - ARCH 2712 - 002 Andrew M. Parks mailto:amparks@memphis.edu
10 Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001 Andrew M. Parks mailto:amparks@memphis.edu
11 Determinants of Modern Design - 27436 - ARCH 3221 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
12 Structural Design 2 - 23837 - ARCH 3322 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
13 Professional Practice - 25097 - ARCH 3431 - 001 Andrew M. Parks mailto:amparks@memphis.edu
14 Design Studio 4 - 11115 - ARCH 3714 - 001 Sonia Raheel mailto:sraheel@memphis.edu
15 Design Studio 4 - 23221 - ARCH 3714 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201 Jennifer L. Barker mailto:jlbrker1@memphis.edu
17 Sustainable Design - 19491 - ARCH 4421 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
18 Internship in Architecture - 21000 - ARCH 4430 - 001 Marika E. Snider mailto:mesnider@memphis.edu
19 Design Studio 6 - 11134 - ARCH 4716 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
20 Sustainable Design - 19492 - ARCH 6421 - 001 Marika E. Snider mailto:mesnider@memphis.edu
21 Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001 Marika E. Snider mailto:mesnider@memphis.edu
22 Contemporary Architecture 2 - 24104 - ARCH 7222 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
23 Internship in Architecture - 19495 - ARCH 7430 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
24 Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25 Advanced Design Studio 2 - 18389 - ARCH 7712 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
26 Architecture Research - 25098 - ARCH 7930 - 001 Brian D. Andrews mailto:bdndrews@memphis.edu
27 Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28 Architecture Thesis Studio - 19500 - ARCH 7996 - 004 Brian D. Andrews mailto:bdndrews@memphis.edu
29 Architecture Thesis Studio - 19501 - ARCH 7996 - 005 Andrew M. Parks mailto:amparks@memphis.edu
30 Architecture Thesis Studio - 19502 - ARCH 7996 - 006 Michael D. Hagge mailto:mdhagge@memphis.edu
31 Architecture Thesis Studio - 19503 - ARCH 7996 - 007 Brian D. Andrews mailto:bdndrews@memphis.edu
32 Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
34 <NA> Jennifer L. Barker mailto:jlbrker1@memphis.edu
35 <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
37 <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38 <NA> Brian D. Andrews mailto:bdndrews@memphis.edu
39 <NA> Marika E. Snider mailto:mesnider@memphis.edu
但在 link 中,数据看起来不同。
例如:
很少有类 without any instructor and email
(提到TBA
)如下:
而且很少有其他 类 有 two/three/four/multiple instructors
。
并且很少有其他 类 与 same instructor given multiple times
如下所示:
对于此类数据,我希望我的输出如下所示:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 TBA
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Marika E. Snider mailto:mesnider@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
P.S。如果发布的 URL link 不起作用。请关注:
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search
如何处理缺失数据(TBA)、多位指导员、多次给定同一位指导员?
问题在于使用 html_nodes()
函数。此函数将 return 值列表,而不考虑值是在哪个节点找到的。由于您的网页有时每个 class 或 none 会有多个讲师,因此需要更有针对性的方法。
在这个代码块中,我们首先找到包含我们想要的所有信息的每个 class 节点。然后我们单独解析每个节点(在 lapply
函数内)以提取教师和电子邮件,同时检查空字段。每个教师的每个数据框都有一行,所以如果有多个教师,一些数据框会有多行。
我们 assemble 每个 class 的数据框列表 (bind_rows
),然后合并相同 class[=14= 的讲师和电子邮件结果]
library(rvest)
library(dplyr)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") %>% html_text()
classinfo <- html %>% html_nodes("tr td.dddefault")
classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']")
classinfo <- classinfo[nchar( html_text(classinfo))>50 ] #eliminate the extra found nodes
classlink <- classinfo %>% html_nodes("a") %>% html_attr("href") #find all links
classlinktext <- classinfo %>% html_nodes("a") %>% html_text() #find the link text
classlink <- classlink[classlinktext=="View Catalog Entry"] #keep only the links for "View Catalog Entry"
dfs <-lapply(1:length(classinfo), function(i) {
# classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>%
html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_node, "target")
emails <- html_attr(instructor_node, "href")
#check to see if instructor was assign if not TBD
if(length(instructors)==0){
instructors <- "TBD"
emails <- "NA"
}
data.frame(classname=classes[i], link=classlink[i], instructors, emails)
})
#merge list into data frame
answer<- bind_rows(dfs)
#consolidation the instructions in the same class
finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
# the paste(instructors, collapse = ", ") could be contained within the lapply
# loop but adding it here add some flexibility depending on whether
# answer or final answer is the end result.
head(finalanswer, 16)
tail(finalanswer, 16)
我正在使用脚本从 link 中抓取所需的内容,其中有不同的主题。
library(rvest)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
在上面的查询中 sel_subj
每个不同主题的变化
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>%
html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_nodes, "target")
emails <- html_attr(instructor_nodes, "href")
length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39
sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])
结果如下所示,这是错误的:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Marika E. Snider mailto:mesnider@memphis.edu
7 Computer Apps in Design 2 - 11111 - ARCH 2612 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
8 Design Studio 2 - 11112 - ARCH 2712 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
9 Design Studio 2 - 15408 - ARCH 2712 - 002 Andrew M. Parks mailto:amparks@memphis.edu
10 Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001 Andrew M. Parks mailto:amparks@memphis.edu
11 Determinants of Modern Design - 27436 - ARCH 3221 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
12 Structural Design 2 - 23837 - ARCH 3322 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
13 Professional Practice - 25097 - ARCH 3431 - 001 Andrew M. Parks mailto:amparks@memphis.edu
14 Design Studio 4 - 11115 - ARCH 3714 - 001 Sonia Raheel mailto:sraheel@memphis.edu
15 Design Studio 4 - 23221 - ARCH 3714 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201 Jennifer L. Barker mailto:jlbrker1@memphis.edu
17 Sustainable Design - 19491 - ARCH 4421 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
18 Internship in Architecture - 21000 - ARCH 4430 - 001 Marika E. Snider mailto:mesnider@memphis.edu
19 Design Studio 6 - 11134 - ARCH 4716 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
20 Sustainable Design - 19492 - ARCH 6421 - 001 Marika E. Snider mailto:mesnider@memphis.edu
21 Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001 Marika E. Snider mailto:mesnider@memphis.edu
22 Contemporary Architecture 2 - 24104 - ARCH 7222 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
23 Internship in Architecture - 19495 - ARCH 7430 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
24 Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25 Advanced Design Studio 2 - 18389 - ARCH 7712 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
26 Architecture Research - 25098 - ARCH 7930 - 001 Brian D. Andrews mailto:bdndrews@memphis.edu
27 Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28 Architecture Thesis Studio - 19500 - ARCH 7996 - 004 Brian D. Andrews mailto:bdndrews@memphis.edu
29 Architecture Thesis Studio - 19501 - ARCH 7996 - 005 Andrew M. Parks mailto:amparks@memphis.edu
30 Architecture Thesis Studio - 19502 - ARCH 7996 - 006 Michael D. Hagge mailto:mdhagge@memphis.edu
31 Architecture Thesis Studio - 19503 - ARCH 7996 - 007 Brian D. Andrews mailto:bdndrews@memphis.edu
32 Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
34 <NA> Jennifer L. Barker mailto:jlbrker1@memphis.edu
35 <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
37 <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38 <NA> Brian D. Andrews mailto:bdndrews@memphis.edu
39 <NA> Marika E. Snider mailto:mesnider@memphis.edu
但在 link 中,数据看起来不同。
例如:
很少有类 without any instructor and email
(提到TBA
)如下:
而且很少有其他 类 有 two/three/four/multiple instructors
。
并且很少有其他 类 与 same instructor given multiple times
如下所示:
对于此类数据,我希望我的输出如下所示:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 TBA
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Marika E. Snider mailto:mesnider@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
P.S。如果发布的 URL link 不起作用。请关注:
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search
如何处理缺失数据(TBA)、多位指导员、多次给定同一位指导员?
问题在于使用 html_nodes()
函数。此函数将 return 值列表,而不考虑值是在哪个节点找到的。由于您的网页有时每个 class 或 none 会有多个讲师,因此需要更有针对性的方法。
在这个代码块中,我们首先找到包含我们想要的所有信息的每个 class 节点。然后我们单独解析每个节点(在 lapply
函数内)以提取教师和电子邮件,同时检查空字段。每个教师的每个数据框都有一行,所以如果有多个教师,一些数据框会有多行。
我们 assemble 每个 class 的数据框列表 (bind_rows
),然后合并相同 class[=14= 的讲师和电子邮件结果]
library(rvest)
library(dplyr)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") %>% html_text()
classinfo <- html %>% html_nodes("tr td.dddefault")
classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']")
classinfo <- classinfo[nchar( html_text(classinfo))>50 ] #eliminate the extra found nodes
classlink <- classinfo %>% html_nodes("a") %>% html_attr("href") #find all links
classlinktext <- classinfo %>% html_nodes("a") %>% html_text() #find the link text
classlink <- classlink[classlinktext=="View Catalog Entry"] #keep only the links for "View Catalog Entry"
dfs <-lapply(1:length(classinfo), function(i) {
# classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>%
html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_node, "target")
emails <- html_attr(instructor_node, "href")
#check to see if instructor was assign if not TBD
if(length(instructors)==0){
instructors <- "TBD"
emails <- "NA"
}
data.frame(classname=classes[i], link=classlink[i], instructors, emails)
})
#merge list into data frame
answer<- bind_rows(dfs)
#consolidation the instructions in the same class
finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
# the paste(instructors, collapse = ", ") could be contained within the lapply
# loop but adding it here add some flexibility depending on whether
# answer or final answer is the end result.
head(finalanswer, 16)
tail(finalanswer, 16)