从 html 网页获取节点以使用 R 抓取 URL

Get nodes from a html webpage to crawl URLs using R

https://i.stack.imgur.com/xeczg.png

我正在尝试从网页“https://www.sgcarmart.com/main/index.php”获取节点“.2lines”下的 URL

library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()

我收到 html_nodes 函数的错误:

Error in parse_simple_selector(stream) : 
  Expected selector, got <NUMBER '.2' at 1>

如何解决此错误?

您可以使用 xpath 选择器来查找所需的节点。这些链接实际上包含在您试图通过 class 引用的 <p> 标签内的 <a> 标签中。您可以在单个 xpath 中访问它们:

library(rvest)

site <- 'https://www.sgcarmart.com'

urls <-  site                                           %>%
         paste0("/main/index.php")                      %>%
         read_html()                                    %>% 
         html_nodes(xpath = "//*[@class = '2lines']/a") %>% 
         html_attr("href")                              %>%
         {paste0(site, .)}

urls
#>  [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#>  [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#>  [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#>  [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#>  [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#>  [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#>  [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#>  [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#>  [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"