在 R 中查找 html table name 和 scrape

Question

我正在尝试从具有多个 table 的网页中抓取 table。我想从 https://www.census.gov/geo/reference/ansi_statetables.html 得到 "FIPS Codes for the States and the District of Columbia" table 。我认为 XML::readHTMLTable() 是正确的方法，但是当我尝试以下操作时出现错误：

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)

named list() Warning message: XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'

当然，这并不奇怪，因为我没有向函数提供任何指示我想阅读哪个 table。我已经在 "Inspect" 中研究了很长一段时间，但我并没有把如何更精确地联系起来。 table 的名称或 class 似乎与我在文档或 SO 上找到的其他示例类似。想法？

Answer 1

另一种使用 rvest 而不是 XML 的解决方案是：

require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>% 
  html_table %>% .[[1]]

Answer 2

考虑使用 readLines() 抓取 html 页面内容并在 readHTMLTable() 中使用结果：

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)

readHTMLTable(webpage, header = T, stringsAsFactors = F)               # LIST OF 3 TABLES

# $`NULL`
#                    Name FIPS State Numeric Code Official USPS Code
# 1               Alabama                      01                 AL
# 2                Alaska                      02                 AK
# 3               Arizona                      04                 AZ
# 4              Arkansas                      05                 AR
# 5            California                      06                 CA
# 6              Colorado                      08                 CO
# 7           Connecticut                      09                 CT
# 8              Delaware                      10                 DE
# 9  District of Columbia                      11                 DC
# 10              Florida                      12                 FL
# 11              Georgia                      13                 GA
# 12               Hawaii                      15                 HI
# 13                Idaho                      16                 ID
# 14             Illinois                      17                 IL
# ...

对于特定数据帧 return:

fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]

在 R 中查找 html table name 和 scrape

Find html table name and scrape in R

xml

screen-scraping

r