如何处理提取链接中的空格(rvest)
How to treat spaces in extracted links(rvest)
我从一个网站中提取了超链接。我希望进一步抓取这些链接,但这些链接包含通常应替换为 %20 的空格。因此,我收到了 404 错误。这些是我保存在变量 'url':
下的输出的超链接
[1] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html"
[2] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html"
[3] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html"
[4] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"
这是我出错的代码:
map_df(url, function(i){
page <- read_html(i)%>%
html_nodes("table") %>%
html_table(fill = TRUE)})
这是我遇到的错误:
Error in open.connection(x, "rb") : HTTP error 404.
您可以简单地将空格替换为“%20”符号:
tablist <- map(gsub(" ", "%20", url), function(i){
read_html(i) %>%
html_nodes("table")
})
这导致:
tablist
#> [[1]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[2]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[3]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[4]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
不幸的是,这些页面上的表格似乎都是空的,因此您无法对结果调用 html_table
。
一种更通用的方法是使用 URLencode()
函数。它还会替换 URL 中的其他特殊字符以使 URL 有效:
urls <-
c(
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"
)
encoded_urls <- lapply(urls, function(url) URLencode(url))
我从一个网站中提取了超链接。我希望进一步抓取这些链接,但这些链接包含通常应替换为 %20 的空格。因此,我收到了 404 错误。这些是我保存在变量 'url':
下的输出的超链接[1] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html"
[2] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html"
[3] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html"
[4] "https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"
这是我出错的代码:
map_df(url, function(i){
page <- read_html(i)%>%
html_nodes("table") %>%
html_table(fill = TRUE)})
这是我遇到的错误:
Error in open.connection(x, "rb") : HTTP error 404.
您可以简单地将空格替换为“%20”符号:
tablist <- map(gsub(" ", "%20", url), function(i){
read_html(i) %>%
html_nodes("table")
})
这导致:
tablist
#> [[1]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[2]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[3]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
#>
#> [[4]]
#> {xml_nodeset (6)}
#> [1] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [2] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [3] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [4] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [5] <table id="employee_data" class="table table-striped table-bordered" wid ...
#> [6] <table id="employee_data" class="table table-striped table-bordered" wid ...
不幸的是,这些页面上的表格似乎都是空的,因此您无法对结果调用 html_table
。
一种更通用的方法是使用 URLencode()
函数。它还会替换 URL 中的其他特殊字符以使 URL 有效:
urls <-
c(
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U18109MH2006PLC262077.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L70101HR1963PLC002484.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=L65910MH1986PLC165645.html",
"https://csr.gov.in/companyprofile.php?year=FY 2014-15&CIN=U72200KA2002PLC030310.html"
)
encoded_urls <- lapply(urls, function(url) URLencode(url))