需要使用 R 中的 rvest 提取以下没有明确 xpath 的文本
Need to extract following texts which doesn't have a clear xpath with rvest in R
我有几个要抓取的网页(下面的html示例)。在我的示例中,我想获取公司名称、位置、薪水、发布日期,所以我获取公司名称的方法是这样的:
library(xml2)
library(rvest)
library(tidyverse)
url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(url)
company_name <- page %>%
rvest::html_nodes("li") %>%
rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
#rvest::html_nodes("div")%>%
rvest::html_nodes("span") %>%
#rvest::html_name()%>%
rvest::html_text()%>%
stringr::str_replace_all("[\r\n\t]" , "")%>%
stringr::str_trim()
然而这会产生:
# [1] "Description"
# [2] "We are looking for a Skilled, Dynamic, and Collaborative Leader"
# [3] "Mobile Public Library"
# [4] ""
# [5] "Mobile, Alabama, United States"
# [6] "53812381"
# [7] "April 21, 2020"
# [8] "Library Director"
# [9] "Mobile Public Library"
# [10] "Public Library"
# [11] "Administration/Management"
# [12] "No"
# [13] "Full-Time"
# [14] "Indefinite"
# [15] "Master's Degree"
# [16] "5-7 Years"
# [17] "0-10%"
# [18] "Jobs You May Like"
我以为我可以通过索引得到我想要的东西,但是当我移动到下一个站点时,一些元素的位置发生了变化。喜欢这里:
url <- "https://joblist.ala.org/job/ceo-library-director-orange-county-library-system/53673222/"
page <- xml2::read_html(url)
company_name <- page %>%
rvest::html_nodes("li") %>%
rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
#rvest::html_nodes("div")%>%
rvest::html_nodes("span") %>%
#rvest::html_name()%>%
rvest::html_text()%>%
stringr::str_replace_all("[\r\n\t]" , "")%>%
stringr::str_trim()
产量:
# [1] "Description"
# [2] "Requirements"
# [3] "Orange County Library System"
# [4] ""
# [5] "Orlando, Florida, 32801, United States"
# [6] "53673222"
# [7] "April 1, 2020"
# [8] "CEO / Library Director - Orange County Library System"
# [9] "Orange County Library System"
# [10] "Public Library"
# [11] "Administration/Management"
# [12] "No"
# [13] "Full-time"
# [14] "Indefinite"
# [15] "Master's Degree"
# [16] "Over 10 Years"
# [17] "10-25%"
# [18] "1,882.00 - 0,000.00 (Yearly Salary)"
# [19] "Jobs You May Like"
控制台检查器如下所示:
<ul>
<li class="clearfix">
<div>Location: </div>
<span class="">
Orlando, Florida, 32801, United States
</span>
</li>
<li class="clearfix">
<div>Job ID: </div>
<span class="">53673222</span>
</li>
<li class="clearfix">
<div>Posted: </div>
<span class="">April 1, 2020</span>
</li>
<li class="clearfix">
<div>Position Title: </div>
<span class="">CEO / Library Director - Orange County Library System</span>
</li>
<li class="clearfix">
<div>Company Name: </div>
<span class="">Orange County Library System</span>
</li>
<li class="clearfix">
<div>Library or Company Type: </div>
<span class="">Public Library</span>
</li>
<li class="clearfix">
<div>Job Category: </div>
<span class="">Administration/Management</span>
</li>
<li class="clearfix">
<div>Entry Level: </div>
<span class="">No</span>
</li>
<li class="clearfix">
<div>Job Type: </div>
<span class="break-all">Full-time</span>
</li>
<li class="clearfix">
<div>Job Duration: </div>
<span class="break-all">Indefinite</span>
</li>
<li class="clearfix">
<div>Min Education: </div>
<span class="break-all">Master's Degree</span>
</li>
<li class="clearfix">
<div>Min Experience: </div>
<span class="break-all">Over 10 Years</span>
</li>
<li class="clearfix">
<div>Required Travel: </div>
<span class="break-all">10-25%</span>
</li>
<li class="clearfix">
<div>Salary: </div>
<span class="break-all">1,882.00 - 0,000.00 (Yearly Salary)</span>
</li>
</ul>
我想知道是否有人可以通过展示如何获取公司名称来帮助我,我可以为其他人复制它。不适用于 HTML。谢谢!
由于没有针对每个类别的特定 类,我们可能会使用正则表达式来提取信息。
library(rvest)
url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(URL)
page %>%
html_nodes("li") %>%
html_nodes(xpath = '//*[@class="clearfix"]') %>%
html_text() %>%
gsub('[\r\n\t]', '', .) %>%
grep('Company Name:', ., value = TRUE) %>%
sub('Company Name:', '', .) %>% .[2]
#[1] " Mobile Public Library"
您可以用同样的方法从其他类别中提取信息。例如,使用 'Position Title:'
:
page %>%
html_nodes("li") %>%
html_nodes(xpath = '//*[@class="clearfix"]') %>%
html_text() %>%
gsub('[\r\n\t]', '', .) %>%
grep('Position Title:', ., value = TRUE) %>%
sub('Position Title:', '', .) %>% .[2]
#[1] " Library Director"
或许,您可以只编写一个函数并将 "Company Name:"
和 "Position Title:"
之类的字符串传递给它。
我有几个要抓取的网页(下面的html示例)。在我的示例中,我想获取公司名称、位置、薪水、发布日期,所以我获取公司名称的方法是这样的:
library(xml2)
library(rvest)
library(tidyverse)
url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(url)
company_name <- page %>%
rvest::html_nodes("li") %>%
rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
#rvest::html_nodes("div")%>%
rvest::html_nodes("span") %>%
#rvest::html_name()%>%
rvest::html_text()%>%
stringr::str_replace_all("[\r\n\t]" , "")%>%
stringr::str_trim()
然而这会产生:
# [1] "Description"
# [2] "We are looking for a Skilled, Dynamic, and Collaborative Leader"
# [3] "Mobile Public Library"
# [4] ""
# [5] "Mobile, Alabama, United States"
# [6] "53812381"
# [7] "April 21, 2020"
# [8] "Library Director"
# [9] "Mobile Public Library"
# [10] "Public Library"
# [11] "Administration/Management"
# [12] "No"
# [13] "Full-Time"
# [14] "Indefinite"
# [15] "Master's Degree"
# [16] "5-7 Years"
# [17] "0-10%"
# [18] "Jobs You May Like"
我以为我可以通过索引得到我想要的东西,但是当我移动到下一个站点时,一些元素的位置发生了变化。喜欢这里:
url <- "https://joblist.ala.org/job/ceo-library-director-orange-county-library-system/53673222/"
page <- xml2::read_html(url)
company_name <- page %>%
rvest::html_nodes("li") %>%
rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
#rvest::html_nodes("div")%>%
rvest::html_nodes("span") %>%
#rvest::html_name()%>%
rvest::html_text()%>%
stringr::str_replace_all("[\r\n\t]" , "")%>%
stringr::str_trim()
产量:
# [1] "Description"
# [2] "Requirements"
# [3] "Orange County Library System"
# [4] ""
# [5] "Orlando, Florida, 32801, United States"
# [6] "53673222"
# [7] "April 1, 2020"
# [8] "CEO / Library Director - Orange County Library System"
# [9] "Orange County Library System"
# [10] "Public Library"
# [11] "Administration/Management"
# [12] "No"
# [13] "Full-time"
# [14] "Indefinite"
# [15] "Master's Degree"
# [16] "Over 10 Years"
# [17] "10-25%"
# [18] "1,882.00 - 0,000.00 (Yearly Salary)"
# [19] "Jobs You May Like"
控制台检查器如下所示:
<ul>
<li class="clearfix">
<div>Location: </div>
<span class="">
Orlando, Florida, 32801, United States
</span>
</li>
<li class="clearfix">
<div>Job ID: </div>
<span class="">53673222</span>
</li>
<li class="clearfix">
<div>Posted: </div>
<span class="">April 1, 2020</span>
</li>
<li class="clearfix">
<div>Position Title: </div>
<span class="">CEO / Library Director - Orange County Library System</span>
</li>
<li class="clearfix">
<div>Company Name: </div>
<span class="">Orange County Library System</span>
</li>
<li class="clearfix">
<div>Library or Company Type: </div>
<span class="">Public Library</span>
</li>
<li class="clearfix">
<div>Job Category: </div>
<span class="">Administration/Management</span>
</li>
<li class="clearfix">
<div>Entry Level: </div>
<span class="">No</span>
</li>
<li class="clearfix">
<div>Job Type: </div>
<span class="break-all">Full-time</span>
</li>
<li class="clearfix">
<div>Job Duration: </div>
<span class="break-all">Indefinite</span>
</li>
<li class="clearfix">
<div>Min Education: </div>
<span class="break-all">Master's Degree</span>
</li>
<li class="clearfix">
<div>Min Experience: </div>
<span class="break-all">Over 10 Years</span>
</li>
<li class="clearfix">
<div>Required Travel: </div>
<span class="break-all">10-25%</span>
</li>
<li class="clearfix">
<div>Salary: </div>
<span class="break-all">1,882.00 - 0,000.00 (Yearly Salary)</span>
</li>
</ul>
我想知道是否有人可以通过展示如何获取公司名称来帮助我,我可以为其他人复制它。不适用于 HTML。谢谢!
由于没有针对每个类别的特定 类,我们可能会使用正则表达式来提取信息。
library(rvest)
url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(URL)
page %>%
html_nodes("li") %>%
html_nodes(xpath = '//*[@class="clearfix"]') %>%
html_text() %>%
gsub('[\r\n\t]', '', .) %>%
grep('Company Name:', ., value = TRUE) %>%
sub('Company Name:', '', .) %>% .[2]
#[1] " Mobile Public Library"
您可以用同样的方法从其他类别中提取信息。例如,使用 'Position Title:'
:
page %>%
html_nodes("li") %>%
html_nodes(xpath = '//*[@class="clearfix"]') %>%
html_text() %>%
gsub('[\r\n\t]', '', .) %>%
grep('Position Title:', ., value = TRUE) %>%
sub('Position Title:', '', .) %>% .[2]
#[1] " Library Director"
或许,您可以只编写一个函数并将 "Company Name:"
和 "Position Title:"
之类的字符串传递给它。