rvest，table 带有 thead 和 tbody 标签

Question

我正在慢慢学习使用 rvest 进行网络抓取。我想抓取以下 table https://novostavby.com/cs/developery/ 我主要对第一列感兴趣，但我不介意获得整个 table.

我尝试了两种可能的方法。最简单的只解析头部：

url <- 'https://novostavby.com/cs/developery/'
read_html(x=url) %>% 
  html_nodes('table') %>% 
  html_table

接下来我尝试了

html_nodes(webpage, 'table') %>% html_nodes('.type')

但是它只返回了header（不知道为什么...）

感谢您的帮助！

Answer 1

您的 url 指向包含空 table 的 html 页面。您可以在网络浏览器中看到 table 的内容的原因是 html 指示您的浏览器从不同的页面下载 table 的内容并将其插入到空 table。当然，rvest只是读取第一页的html而没有运行 javascript加载table数据。

在您的例子中，数据是从另一个指向 JSON 文件的 url 加载的。实际上可以将其内容插入原始 html 并使用 rvest 获取您的 table。这实际上是手动执行您的浏览器执行的操作。

require(httr)
require(magrittr)
require(rvest)

# Get the page's html as text
url <- 'https://novostavby.com/cs/developery/'
original_page <- GET(url) %>% content("text") 

# Get the JSON as plain text from the link generated by Javascript on the page
json_url <- "https://novostavby.com/ajax-estatio-developers/?citypath=undefined&sortdir=asc&sortfield=title&search=&pagefrom=developers"
JSON <- GET(json_url) %>% content("text", encoding = "utf8") 

# Remove the double escapes and enclosing brackets / html key from the JSON
# to get its html contents
table_contents <- JSON     %>%
 {gsub("\\n", "\n", .)}  %>%
 {gsub("\\/", "/", .)}   %>%
 {gsub("\\\"", "\"", .)} %>%
  strsplit("html\":\"")    %>%
  unlist                   %>%
  extract(2)               %>%
  substr(1, nchar(.) -2)   %>% 
  paste0("</tbody>")

# insert the table html into the original page
new_page <- gsub("</tbody>", table_contents, original_page)

# Now you can read the table with rvest
read_html(new_page)   %>%
  html_nodes("table") %>%
  html_table()

这为您提供了您想要的 table。唯一的问题是所有非 ascii 字符都显示为 unicode，例如 u00de。您需要将它们 gsub 到它们的等效字符。

rvest，table 带有 thead 和 tbody 标签

rvest, table with thead and tbody tags

r

rvest