使用 class 从网页中提取链接列表
Extracting a list of links from a webpage by using its class
我正在尝试从 this website 中提取一个包含四个链接的列表,这些链接明确命名为:
PNADC_012018_20190729.zip
PNADC_022018_20190729.zip
PNADC_032018_20190729.zip
PNADC_042018_20190729.zip
我看到它们都是 class 的一部分,叫做 'jstree-wholerow'。我不太擅长抓取,但我尝试使用这种规律来捕获此类链接:
x <- rvest::read_html('https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018') %>%
rvest::html_nodes("jstree-wholerow") %>%
rvest::html_text()
但是,我收到一个空向量作为输出。
有人可以帮忙解决这个问题吗?
虽然网页使用javascript,但文件存储在ftp中。它还具有非常可预测的目录名称。
library(tidyverse)
library(stringr)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(RCurl)
#>
#> Attaching package: 'RCurl'
#> The following object is masked from 'package:tidyr':
#>
#> complete
link <- 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip'
zip_names <- c('PNADC_012018_20190729.zip', 'PNADC_022018_20190729.zip', 'PNADC_032018_20190729.zip', 'PNADC_042018_20190729.zip')
links <- str_replace(link, '/2018.*\.zip$', str_c('/2018/', zip_names))
links
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
#option 2
links <- RCurl::getURL(url = 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/') %>% read_html() %>%
html_nodes(xpath = '//td/a[@href]') %>% html_attr('href')
links <- links[-1]
links
#> [1] "PNADC_012018_20190729.zip" "PNADC_022018_20190729.zip"
#> [3] "PNADC_032018_20190729.zip" "PNADC_042018_20190729.zip"
str_c('https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/', links)
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
由 reprex package (v2.0.0)
于 2021 年 6 月 11 日创建
我正在尝试从 this website 中提取一个包含四个链接的列表,这些链接明确命名为:
PNADC_012018_20190729.zip
PNADC_022018_20190729.zip
PNADC_032018_20190729.zip
PNADC_042018_20190729.zip
我看到它们都是 class 的一部分,叫做 'jstree-wholerow'。我不太擅长抓取,但我尝试使用这种规律来捕获此类链接:
x <- rvest::read_html('https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018') %>%
rvest::html_nodes("jstree-wholerow") %>%
rvest::html_text()
但是,我收到一个空向量作为输出。
有人可以帮忙解决这个问题吗?
虽然网页使用javascript,但文件存储在ftp中。它还具有非常可预测的目录名称。
library(tidyverse)
library(stringr)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(RCurl)
#>
#> Attaching package: 'RCurl'
#> The following object is masked from 'package:tidyr':
#>
#> complete
link <- 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip'
zip_names <- c('PNADC_012018_20190729.zip', 'PNADC_022018_20190729.zip', 'PNADC_032018_20190729.zip', 'PNADC_042018_20190729.zip')
links <- str_replace(link, '/2018.*\.zip$', str_c('/2018/', zip_names))
links
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
#option 2
links <- RCurl::getURL(url = 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/') %>% read_html() %>%
html_nodes(xpath = '//td/a[@href]') %>% html_attr('href')
links <- links[-1]
links
#> [1] "PNADC_012018_20190729.zip" "PNADC_022018_20190729.zip"
#> [3] "PNADC_032018_20190729.zip" "PNADC_042018_20190729.zip"
str_c('https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/', links)
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
由 reprex package (v2.0.0)
于 2021 年 6 月 11 日创建