rvest:无法使 follow_link() 或 jump_to 移动到无序列表的链接
rvest: cannot make follow_link() or jump_to move to an unordered list of links
我正在抓取一个网站
url <- "https://www.rsaconference.com/usa/expo-and-sponsors"
该网站有 link 个字母表,即 A B C D .... Z。这是从该网站复制的 html 代码。如果我想按照 link 说字母 'B' 或 'L' 使用 rvest 包的最佳方法是什么?
<ul class="search-a-z" data-field-id="search__filter-letter">
<li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by "> </a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by #">#</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by A">A</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by B">B</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by C">C</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by D">D</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by E">E</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by F">F</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by G">G</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by H">H</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by I">I</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by J">J</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by K">K</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by L">L</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by M">M</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by N">N</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by O">O</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by Ø">Ø</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by P">P</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Q">Q</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by R">R</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by S">S</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by T">T</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by U">U</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by V">V</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by W">W</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by X">X</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Y">Y</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Z">Z</a></li>
<li class="search-a-z__item search-a-z__item--clear">
<a class="link link--default search-a-z__clear" href="#search-a-z" title="Clear filter">Clear Filter</a>
</li>
我尝试了以下但立即失败:
s <- html_session(url)
s %>% follow_link(i="Filter results by B")
Error: No links have text 'Filter results by B'
我也试过了,
s %>% html_node(".search-a-z__item:nth-child(4)") %>% follow_link()
Error in follow_link(.) : is.session(x) is not TRUE
我的 objective 是遍历每个 A 到 Z link 并从每个页面中抓取公司名称。
我搜索了很多 Whosebug 问题,例如Looping through a list of webpages with rvest follow_link 和
但无法解决这个问题。
当您点击页面上的每个字母时,javascript 会向服务器上的另一个 url 发送一个 xhr POST 请求,并将该请求编码为嵌套 JSON。坏消息是你需要做同样的事情来抓取数据。好消息是,如果你正确地编写请求,你可以一次获取所有数据gulp。
您将需要 Rcurl 或 httr 包来为您提供对 http 请求的这种级别的控制。
# We'll use httr and tidyverse
library(tidyverse)
library(httr)
# This is the actual url that sends the JSON data
url <- "https://www.rsaconference.com/api/Search/FilteredSearch"
# These are the parameters we want to post. Note I have left the searchFilterLetter
# field blank so it sends us everything.
params <- list(defaultFilterContentType = "Exhibitor",
searchInput = "",
searchFilterLetter = "",
exhibitorLocation = "none",
exhibitorType = "none",
filterTopicsTypeahead = "",
filterTopics = "",
searchSort = "alpha",
filterRegion = "USA",
filterConferenceYear = "2020")
# A complicating factor is that the above parsmeters are wrapped inside another
# parameter called formDsta, along with two other parameters. Note I want all
# exhibitors so I set resultsPerPage to 1000
body <- list(page = 1, resultsPerPage = 1000, formData = params)
# Now we post the form to the url and read the parsed JSON response.
# I have selected two fields from the resulting list.
POST(url, body = body, encode = "json") %>%
content("parsed") %>%
`[[`("results") %>%
lapply(function(x) data.frame(name = x$title, url = x$url)) %>%
{do.call("rbind", .)} %>%
as_tibble ->
all_exhibitors
这是你的结果...
print(all_exhibitors)
#> # A tibble: 635 x 2
#> name url
#> <fct> <fct>
#> 1 1TOUCH.io /usa/expo-and-sponsors/1touchio
#> 2 360 Group /usa/expo-and-sponsors/360-security-group
#> 3 Abnormal Security /usa/expo-and-sponsors/abnormal-security-corporation
#> 4 Acalvio Technologies /usa/expo-and-sponsors/acalvio-technologies
#> 5 Accedian /usa/expo-and-sponsors/accedian-networks
#> 6 achelos GmbH /usa/expo-and-sponsors/achelos-gmbh
#> 7 ACID Technologies /usa/expo-and-sponsors/acid-technologies
#> 8 Active Defense Institute /usa/expo-and-sponsors/active-defense-institute-ltd
#> 9 Acunetix /usa/expo-and-sponsors/acunetix
#> 10 Adaptiva /usa/expo-and-sponsors/adaptiva
#> # ... with 625 more rows
我正在抓取一个网站 url <- "https://www.rsaconference.com/usa/expo-and-sponsors"
该网站有 link 个字母表,即 A B C D .... Z。这是从该网站复制的 html 代码。如果我想按照 link 说字母 'B' 或 'L' 使用 rvest 包的最佳方法是什么?
<ul class="search-a-z" data-field-id="search__filter-letter">
<li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by "> </a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by #">#</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by A">A</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by B">B</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by C">C</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by D">D</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by E">E</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by F">F</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by G">G</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by H">H</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by I">I</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by J">J</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by K">K</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by L">L</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by M">M</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by N">N</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by O">O</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by Ø">Ø</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by P">P</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Q">Q</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by R">R</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by S">S</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by T">T</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by U">U</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by V">V</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by W">W</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by X">X</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Y">Y</a></li>
<li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Z">Z</a></li>
<li class="search-a-z__item search-a-z__item--clear">
<a class="link link--default search-a-z__clear" href="#search-a-z" title="Clear filter">Clear Filter</a>
</li>
我尝试了以下但立即失败:
s <- html_session(url)
s %>% follow_link(i="Filter results by B")
Error: No links have text 'Filter results by B'
我也试过了,
s %>% html_node(".search-a-z__item:nth-child(4)") %>% follow_link()
Error in follow_link(.) : is.session(x) is not TRUE
我的 objective 是遍历每个 A 到 Z link 并从每个页面中抓取公司名称。
我搜索了很多 Whosebug 问题,例如Looping through a list of webpages with rvest follow_link 和
当您点击页面上的每个字母时,javascript 会向服务器上的另一个 url 发送一个 xhr POST 请求,并将该请求编码为嵌套 JSON。坏消息是你需要做同样的事情来抓取数据。好消息是,如果你正确地编写请求,你可以一次获取所有数据gulp。
您将需要 Rcurl 或 httr 包来为您提供对 http 请求的这种级别的控制。
# We'll use httr and tidyverse
library(tidyverse)
library(httr)
# This is the actual url that sends the JSON data
url <- "https://www.rsaconference.com/api/Search/FilteredSearch"
# These are the parameters we want to post. Note I have left the searchFilterLetter
# field blank so it sends us everything.
params <- list(defaultFilterContentType = "Exhibitor",
searchInput = "",
searchFilterLetter = "",
exhibitorLocation = "none",
exhibitorType = "none",
filterTopicsTypeahead = "",
filterTopics = "",
searchSort = "alpha",
filterRegion = "USA",
filterConferenceYear = "2020")
# A complicating factor is that the above parsmeters are wrapped inside another
# parameter called formDsta, along with two other parameters. Note I want all
# exhibitors so I set resultsPerPage to 1000
body <- list(page = 1, resultsPerPage = 1000, formData = params)
# Now we post the form to the url and read the parsed JSON response.
# I have selected two fields from the resulting list.
POST(url, body = body, encode = "json") %>%
content("parsed") %>%
`[[`("results") %>%
lapply(function(x) data.frame(name = x$title, url = x$url)) %>%
{do.call("rbind", .)} %>%
as_tibble ->
all_exhibitors
这是你的结果...
print(all_exhibitors)
#> # A tibble: 635 x 2
#> name url
#> <fct> <fct>
#> 1 1TOUCH.io /usa/expo-and-sponsors/1touchio
#> 2 360 Group /usa/expo-and-sponsors/360-security-group
#> 3 Abnormal Security /usa/expo-and-sponsors/abnormal-security-corporation
#> 4 Acalvio Technologies /usa/expo-and-sponsors/acalvio-technologies
#> 5 Accedian /usa/expo-and-sponsors/accedian-networks
#> 6 achelos GmbH /usa/expo-and-sponsors/achelos-gmbh
#> 7 ACID Technologies /usa/expo-and-sponsors/acid-technologies
#> 8 Active Defense Institute /usa/expo-and-sponsors/active-defense-institute-ltd
#> 9 Acunetix /usa/expo-and-sponsors/acunetix
#> 10 Adaptiva /usa/expo-and-sponsors/adaptiva
#> # ... with 625 more rows