rvest:无法使 follow_link() 或 jump_to 移动到无序列表的链接

rvest: cannot make follow_link() or jump_to move to an unordered list of links

我正在抓取一个网站 url <- "https://www.rsaconference.com/usa/expo-and-sponsors"

该网站有 link 个字母表,即 A B C D .... Z。这是从该网站复制的 html 代码。如果我想按照 link 说字母 'B' 或 'L' 使用 rvest 包的最佳方法是什么?

        <ul class="search-a-z" data-field-id="search__filter-letter">
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by  "> </a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by #">#</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by A">A</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by B">B</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by C">C</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by D">D</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by E">E</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by F">F</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by G">G</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by H">H</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by I">I</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by J">J</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by K">K</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by L">L</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by M">M</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by N">N</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by O">O</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter search-a-z__filter--disabled" href="#search-a-z" title="Filter results by &#216;">&#216;</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by P">P</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Q">Q</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by R">R</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by S">S</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by T">T</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by U">U</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by V">V</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by W">W</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by X">X</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Y">Y</a></li>
        <li class="search-a-z__item"><a class="link link--default search-a-z__filter " href="#search-a-z" title="Filter results by Z">Z</a></li>
<li class="search-a-z__item search-a-z__item--clear">
    <a class="link link--default search-a-z__clear" href="#search-a-z" title="Clear filter">Clear Filter</a>
</li>

我尝试了以下但立即失败:

s <- html_session(url)     
s %>% follow_link(i="Filter results by B")

Error: No links have text 'Filter results by B'

我也试过了,

s %>% html_node(".search-a-z__item:nth-child(4)") %>% follow_link()

Error in follow_link(.) : is.session(x) is not TRUE

我的 objective 是遍历每个 A 到 Z link 并从每个页面中抓取公司名称。

我搜索了很多 Whosebug 问题,例如Looping through a list of webpages with rvest follow_link 但无法解决这个问题。

当您点击页面上的每个字母时,javascript 会向服务器上的另一个 url 发送一个 xhr POST 请求,并将该请求编码为嵌套 JSON。坏消息是你需要做同样的事情来抓取数据。好消息是,如果你正确地编写请求,你可以一次获取所有数据gulp。

您将需要 Rcurl 或 httr 包来为您提供对 http 请求的这种级别的控制。

# We'll use httr and tidyverse
library(tidyverse)
library(httr)

# This is the actual url that sends the JSON data
url <- "https://www.rsaconference.com/api/Search/FilteredSearch"

# These are the parameters we want to post. Note I have left the searchFilterLetter
# field blank so it sends us everything.
params <- list(defaultFilterContentType = "Exhibitor",
               searchInput = "",
               searchFilterLetter = "",
               exhibitorLocation = "none",
               exhibitorType = "none",
               filterTopicsTypeahead = "",
               filterTopics = "",
               searchSort = "alpha",
               filterRegion = "USA",
               filterConferenceYear = "2020")

# A complicating factor is that the above parsmeters are wrapped inside another
# parameter called formDsta, along with two other parameters. Note I want all
# exhibitors so I set resultsPerPage to 1000
body <- list(page = 1, resultsPerPage = 1000, formData = params)

# Now we post the form to the url and read the parsed JSON response.
# I have selected two fields from the resulting list.
POST(url, body = body, encode = "json")                     %>%
content("parsed")                                           %>%
`[[`("results")                                             %>%
lapply(function(x) data.frame(name = x$title, url = x$url)) %>%
{do.call("rbind", .)}                                       %>%
as_tibble                                                    ->
all_exhibitors

这是你的结果...

print(all_exhibitors)
#> # A tibble: 635 x 2
#>    name                     url                                                 
#>    <fct>                    <fct>                                               
#>  1 1TOUCH.io                /usa/expo-and-sponsors/1touchio                     
#>  2 360 Group                /usa/expo-and-sponsors/360-security-group           
#>  3 Abnormal Security        /usa/expo-and-sponsors/abnormal-security-corporation
#>  4 Acalvio Technologies     /usa/expo-and-sponsors/acalvio-technologies         
#>  5 Accedian                 /usa/expo-and-sponsors/accedian-networks            
#>  6 achelos GmbH             /usa/expo-and-sponsors/achelos-gmbh                 
#>  7 ACID Technologies        /usa/expo-and-sponsors/acid-technologies            
#>  8 Active Defense Institute /usa/expo-and-sponsors/active-defense-institute-ltd 
#>  9 Acunetix                 /usa/expo-and-sponsors/acunetix                     
#> 10 Adaptiva                 /usa/expo-and-sponsors/adaptiva                     
#> # ... with 625 more rows