RVEST

Question

我想提取此网页右上角 table 中的值：

https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima

（Wärmster Monat：VALUE，Kältester Monat：VALUE，Jahresniederschlag：VALUE）

不幸的是，如果我使用 html_nodes("Selectorgadgets result for the specific value")，我会收到 link 顶部的 table 的值：

https://www.timeanddate.de/stadt/info/deutschland/karlsruhe

(网页类似，如果你点击顶部栏的“Uhrzeit/Übersicht”，你会访问第二页，table，如果你点击“Wetter”-->“Klima”，你访问第一个 page/table（我想从中提取值的那个！）


 num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"


  num_page= read_html(num_link)


  rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()

  temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()

  temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()

我得到每个变量的“字符（空）”。 :(

提前致谢！

Answer 1

可以使用rvest中的html_table功能，目前已经很不错了。使其更容易提取，但我确实建议您也学习识别正确的 css-选择器，因为它并不总是有效。 html_table 总是 returns 一个包含网页中所有 table 的列表，所以在这种情况下，步骤是：

获得html
得到 tables
索引右边table（这里只有一个）
稍微重新格式化以提取值

library(rvest)
library(tidyverse)


result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%  
  html_table() %>% 
  .[[1]] %>% 
 rename('measurement' = 1,
        'original' = 2) %>% 
  mutate(value_num = str_extract_all(original,"[[:digit:]]+\.*[[:digit:]]*") %>% unlist())

RVEST - 从 table 中提取文本 - 访问权限问题 table

RVEST - Extracting text from table - Problems with access to the right table

html

r

web-scraping