抓取看起来像 table 但实际上不是 table 的维基百科数据

Question

我正在尝试从维基百科中抓取一些数据。我要收集的数据是维基百科页面上第一个“table”中的 # of cases 和 # of deaths。通常我会得到 table 的 xpath 并使用 rvest 但我似乎无法收集这条数据。我实际上更愿意从图形中收集数字，如果我查看我得到的 collapsible 之一（日期 2020-04-04）：

<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl"></div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl"></div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl"></div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>

数据在这里 - 8359、14825、284692 以及 # of cases - 307,876 和 # of deaths - 8,359。我每天都在尝试提取这些数字。

代码：

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"

url %>% 
  read_html() %>% 
  html_node(xpath = '//*[@id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>% 
  html_table(fill = TRUE)

Answer 1

您可以使用 nth-child 来定位各个列。要在每一列中获得正确的行数，使用带有以运算符开头的 css 属性选择器来定位适当的 id 属性和属性值的子字符串是有用的

library(rvest)
library(tidyverse)
library(stringr)

p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')

covid_info <- tibble(
  dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
  cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
  deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>% 
  mutate(
    case_numbers = str_extract(gsub(',','',cases), '^.*(?=\()' ) %>% as.integer(),
    death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\()' ) %>% as.integer(), NA_integer_)
)

print(covid_info)

抓取看起来像 table 但实际上不是 table 的维基百科数据

scraping wikipedia data which looks like a table but is not actually a table

html

r

web-scraping

rvest