open.connection(x, "rb") 中的错误：R 中的 HTTP 错误 404

Question

我正在尝试使用 R 抓取数据以获取有关以下网站中某些列表的详细信息，但我收到一个错误，我不确定如何解决：错误 open.connection(x, "rb") : HTTP 错误 404 我尝试使用 httr 包并尝试在类似帖子中看到的功能，但无法解决它。我做错了什么吗？

library(XML)
library(RCurl)
library(curl) 
library(rvest)
library(tidyverse)
library(dplyr)
library(httr) 

url <- "https://www.sgcarmart.com/new_cars/index.php"
cardetails <- read_html(url)

listing <- html_nodes(cardetails, "#nc_popular_car")
popularcars <- html_nodes(listing,".link")
count<-length(popularcars)


info <- data.frame(CarName=NA, Distributer=NA, Hotline= NA, CountryBuilt= NA, Predecessor= NA, stringsAsFactors = F )

for(i in 1:count)
{
  h <- popularcars[[i]]

details_url <- paste0("https://www.sgcarmart.com/new_cars",html_attr(h,"href"))

details <- read_html(details_url)

info[i,]$CarName <- html_node(details,".link_redbanner")

}

info

Answer 1

TL;DR

添加斜线：

  details_url <- paste0("https://www.sgcarmart.com/new_cars/",html_attr(h,"href"))
  # --->          --->          --->          --->         ^

旅程

我运行你的来源够远得到popularcars，看了第一个：

h <- popularcars[[1]]
h
# {html_node}
# <a href="newcars_overview.php?CarCode=12618" class="link">
# [1] <div style="position:relative; padding-bottom:6px;">\r\n                                < ...
# [2] <div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
details_url <- paste0("https://www.sgcarmart.com/new_cars",html_attr(h,"href"))
details_url
# [1] "https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618"

像你一样，对我来说 URL 返回 404。

我（在无聊的普通浏览器中）导航到主要 URL，查看页面的源代码，并搜索 12618:

<div style="padding:10px 10px 5px 10px;" id="nc_popular_car">
                                        <div class="floatleft" style="text-align:center;width:136px;padding-right:22px;">
                        <a href="newcars_overview.php?CarCode=12618" class="link">
                            <div style="position:relative; padding-bottom:6px;">
                                <div style="position:absolute; border:1px solid #B9B9B9; width:134px; height:88px;"><img src="https://i.i-sgcm.com/images/spacer.gif" width="1" height="1" alt="spacer" /></div>
                                <img src="https://i.i-sgcm.com/new_cars/cars/12618/12618_m.jpg" width="136" height="90" border="0" alt="Toyota Corolla Altis" />
                            </div>
                            <div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
                        </a>
                        <div style="padding-bottom:14px;" class="font_black">,888</div>
                    </div>

我在<a href="newcars_overview.php?CarCode=12618" class="link">部分右击，复制了“link位置，发现是：

https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12618 <-- from the source
https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618  <-- from your code

顺便说一句，您可能会发现这比 for 循环更容易管理。迭代构建框架的效率极低，虽然对于我发现的 18 个条目来说可能还不错，但从长远来看它并不好运行（如果你能避免的话）。

info <- lapply(popularcars, function(h) {
    details_url <- paste0("https://www.sgcarmart.com/new_cars/", html_attr(h,"href"))
    details <- read_html(details_url)
    html_text(html_node(details,".link_redbanner"))
  })

str(info)
# List of 18
#  $ : chr "Toyota Corolla Altis"
#  $ : chr "Hyundai Venue"
#  $ : chr "Hyundai Avante"
#  $ : chr "SKODA Octavia"
#  $ : chr "Honda Civic"
#  $ : chr "Mazda 3 Sedan Mild Hybrid"
#  $ : chr "Honda Jazz"
#  $ : chr "Kia Cerato"
#  $ : chr "Mazda CX-5"
#  $ : chr "Mercedes-Benz GLA-Class(Parallel Imported)"
#  $ : chr "Toyota Raize(Parallel Imported)"
#  $ : chr "Toyota Camry Hybrid(Parallel Imported)"
#  $ : chr "Mercedes-Benz A-Class Hatchback(Parallel Imported)"
#  $ : chr "Mercedes-Benz A-Class Saloon(Parallel Imported)"
#  $ : chr "Honda Fit(Parallel Imported)"
#  $ : chr "Mercedes-Benz C-Class Saloon(Parallel Imported)"
#  $ : chr "Mercedes-Benz CLA-Class(Parallel Imported)"
#  $ : chr "Honda Freed Hybrid(Parallel Imported)"

最后一点：虽然这是一项值得学习的努力，但该网站的 Terms of Service 明确说明："You agree that you will not: ... engage in mass automated, systematic or any form of extraction of the material ("内容") 在我们的网站上"。我假设你的努力在这个限制之下。

open.connection(x, "rb") 中的错误：R 中的 HTTP 错误 404

Error in open.connection(x, "rb") : HTTP error 404 in R

r

rvest

TL;DR

旅程