R Web 抓取网站的多个级别
R Web Scraping Multiple Levels of a Website
我是 R 网络抓取的初学者。在这种情况下,我首先尝试使用 R 进行简单的网络抓取。这就是我所做的工作。
- 从这个网站(https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff)整理出工作人员的详细资料,这是我用过的代码,
library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()
以上代码有效,所有排序的数据都在显示。
- 当你点击每个工作人员时,你可以获得其他详细信息,如研究兴趣、专业领域、简介等.... 我如何获得这些数据并在上述数据中显示该数据根据每个工作人员设置?
下面的代码将使您获得每个教授页面的所有 link。从那里,您可以使用 purrr 的 map_df 或映射函数将每个 link 映射到另一组 rvest 调用。
最重要的是,在应有的地方给予赞扬@hrbrmstr:
linked 的答案略有不同,因为它映射到一组数字,而不是像下面的代码那样映射到 URL 的向量。
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
不确定这是最干净或最有效的方法,但我认为这就是您所追求的。
我是 R 网络抓取的初学者。在这种情况下,我首先尝试使用 R 进行简单的网络抓取。这就是我所做的工作。
- 从这个网站(https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff)整理出工作人员的详细资料,这是我用过的代码,
library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()
以上代码有效,所有排序的数据都在显示。
- 当你点击每个工作人员时,你可以获得其他详细信息,如研究兴趣、专业领域、简介等.... 我如何获得这些数据并在上述数据中显示该数据根据每个工作人员设置?
下面的代码将使您获得每个教授页面的所有 link。从那里,您可以使用 purrr 的 map_df 或映射函数将每个 link 映射到另一组 rvest 调用。
最重要的是,在应有的地方给予赞扬@hrbrmstr:
linked 的答案略有不同,因为它映射到一组数字,而不是像下面的代码那样映射到 URL 的向量。
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
不确定这是最干净或最有效的方法,但我认为这就是您所追求的。