如何处理提取链接中的空格（rvest）

Question

我从一个网站中提取了超链接。我希望进一步抓取这些链接，但这些链接包含通常应替换为 %20 的空格。因此，我收到了 404 错误。这些是我保存在变量 'url':

下的输出的超链接

[1] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html"
[2] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html"
[3] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html"
[4] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"

这是我出错的代码：

map_df(url, function(i){ 
  page <- read_html(i)%>%
    html_nodes("table") %>%
    html_table(fill = TRUE)})

这是我遇到的错误：

 Error in open.connection(x, "rb") : HTTP error 404.

Answer 1

您可以简单地将空格替换为“%20”符号：

tablist <- map(gsub(" ", "%20", url), function(i){ 
  read_html(i) %>%
  html_nodes("table")
})

这导致：

tablist
#> [[1]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> 
#> [[2]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> 
#> [[3]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> 
#> [[4]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...

不幸的是，这些页面上的表格似乎都是空的，因此您无法对结果调用 html_table。

Answer 2

一种更通用的方法是使用 URLencode() 函数。它还会替换 URL 中的其他特殊字符以使 URL 有效：

urls <- 
  c(
    "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html",
    "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html",
    "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html",
    "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"
  )

encoded_urls <- lapply(urls, function(url) URLencode(url))

如何处理提取链接中的空格（rvest）

How to treat spaces in extracted links(rvest)

r

web-scraping

rvest