使用 JavaScript 生成的表格的网络抓取

Question

我正在尝试从 this website 的代码选项卡中抓取 table（包含 x 和 . 的大 table）

我认为以下方法之一可以解决问题...

library(rvest)
library(tidyverse)
"https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section" %>%
  read_html() %>%
  html_table()

"https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section" %>%
  read_html() %>%
  html_nodes(".variablesList , #ui-id-1")

...但是没有任何用处。我查看了 html 文件的来源。我认为该网站正在使用一些 JavaScript 来生成 table？这是否意味着无法获得 table？

注意：我无法在办公室 PC 上安装 RSelenium

Answer 1

我没有看到 robots.txt 也没有条款和条件，但我确实通读了（相当令人生畏） "APPLICATION TO USE RESTRICTED MICRODATA" （我忘了我有一个可以访问 IPUMS，尽管我不记得曾经使用过它）。他们希望在下载之前预先了解其数据的潜在敏感性质的重要性，这给我留下了深刻的印象。

由于此元数据中没有 "microdata"（似乎提供元数据是为了帮助人们决定他们可以使用哪些数据元素 select）并且因为获取和使用它并不违反任何规定的限制，以下应该是可以的。如果 IPUMS 的代表看到这个并且不同意，我会 很乐意 删除答案并要求 SO 管理员真的也删除它（因为那些不知道的人，w/high 足够的代表可以看到已删除的答案）。

现在，您不需要为此使用 Selenium 或 Splash，但您需要对通过以下代码检索到的数据进行一些 post 处理。

构建元数据 tables 的数据在 <script> 标签中的 javascript blob 中（使用 "View Source" 可以看到它，您将以后需要）。我们可以使用一些字符串修改和 V8 包来获取它：

library(V8)
library(rvest)
library(jsonlite)
library(stringi)

pg <- read_html("https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section")

html_nodes(pg, xpath=".//script[contains(., 'Less than')]") %>% 
  html_text() %>% 
  stri_split_lines() %>% 
  .[[1]] -> js_lines

idx <- which(stri_detect_fixed(js_lines, '$(document).ready(function() {')) - 1

找到目标 <script> 元素，获取内容，将其转换为行并找到不是数据的第一行。我们只能用数据提取 javascript 代码，因为 R 中的 V8 引擎不是一个完整的浏览器，不能执行它之后的 jQuery 代码。

我们现在创建一个 "V8 context"，提取代码并在所述 V8 上下文中执行它并检索它：

ctx <- v8()

ctx$eval(paste0(js_lines[1:idx], collapse="\n"))

code_data <- ctx$get("codeData")

str(code_data)
## List of 14
##  $ jsonPath                  : chr "/international-action/frequencies/MIGYRSBR"
##  $ samples                   :'data.frame': 6 obs. of  2 variables:
##   ..$ name: chr [1:6] "br1960a" "br1970a" "br1980a" "br1991a" ...
##   ..$ id  : int [1:6] 2416 2417 2418 2419 2420 2651
##  $ categories                :'data.frame': 100 obs. of  5 variables:
##   ..$ id     : int [1:100] 4725113 4725114 4725115 4725116 4725117 4725118 4725119 4725120 4725121 4725122 ...
##   ..$ label  : chr [1:100] "Less than 1 year" "1" "2" "3" ...
##   ..$ indent : int [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ code   : chr [1:100] "00" "01" "02" "03" ...
##   ..$ general: logi [1:100] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longSamplesHeader         : chr "<tr class=\"fullHeader grayHeader\">\n\n          <th class=\"codesColumn\">Code</th>\n          <th class=\"la"| __truncated__
##  $ samplesHeader             : chr "\n<tr class=\"fullHeader grayHeader\">\n      <th class=\"codesColumn\">Code</th>\n      <th class=\"labelColum"| __truncated__
##  $ showCounts                : logi FALSE
##  $ generalWidth              : int 2
##  $ width                     : int 2
##  $ interval                  : int 25
##  $ isGeneral                 : logi FALSE
##  $ frequencyType             : NULL
##  $ project_uses_survey_groups: logi FALSE
##  $ variables_show_tab_1      : chr ""
##  $ header_type               : chr "short"

jsonPath 组件建议它在代码和频率 table 的构建中使用更多数据，因此我们也可以得到它：

code_json <- fromJSON(sprintf("https://international.ipums.org%s", code_data$jsonPath))

str(code_json, 1)
## List of 6
##  $ 2416:List of 100
##  $ 2417:List of 100
##  $ 2418:List of 100
##  $ 2419:List of 100
##  $ 2420:List of 100
##  $ 2651:List of 100

那些 "Lists of 100" 每个都是 100 个数字。

您需要查看 "View Source" 中的代码（如上所述），了解如何使用这两位数据重新创建元数据 table.

我做认为你最好遵循 @alistaire 开始你的路径，但要完全遵循它。我在论坛 (http://answers.popdata.org/) 中没有看到关于获得 "codes and frequencies" 或 "metadata"（例如这个）的问题，并且至少阅读了 5 个 IPUMS 工作人员在论坛中阅读和回答问题的地方以及他们的信息电子邮件地址：ipums@umn.edu.

他们显然以电子方式在某个地方拥有此元数据，并且可能会为您提供所有数据产品的完整转储以避免进一步抓取（我猜这是您的目标，因为我无法想象人们想要的场景为一个提取物解决这个问题）。

Answer 2

请参阅上面关于抓取的评论，但如果它有帮助，我们刚刚发布了 ipumsr package，这使得在 R 中使用 IPUMS 元数据更容易一些。

如果你用 MIGYRSBR 进行提取，然后下载 DDI（甚至在完整的微数据出现之前就可用），你可以使用以下命令获取代码 table：

# install.packages("ipumsr")
library(ipumsr)
ddi <- read_ipums_ddi("ipumsi_00020.xml")

ipums_val_labels(ddi, "MIGYRSBR")
#> # A tibble: 7 x 2
#>     val                              lbl
#>   <dbl>                            <chr>
#> 1     0                 Less than 1 year
#> 2     6 6 (6 to 10 1960-70, 6 to 9 1980)
#> 3    10                    10 (10+ 1980)
#> 4    11                 11 (11+ 1960-70)
#> 5    97                              97+
#> 6    98                          Unknown
#> 7    99            NIU (not in universe)

或者，您可以加载完整的数据集，值标签将附加为 labelled class 向量（来自 haven）。有关详细信息，请参阅 value-labels vignette。

data <- read_ipums_micro(ddi, verbose = FALSE)
data$MIGYRSBR <- as_factor(data$MIGYRSBR)

table(data$MIGYRSBR)
#> 
#>                 Less than 1 year                                1 
#>                           123862                            65529 
#>                                2                                3 
#>                            77190                            59908 
#>                                4                                5 
#>                            44748                            49590 
#> 6 (6 to 10 1960-70, 6 to 9 1980)                    10 (10+ 1980) 
#>                           185220                                0 
#>                 11 (11+ 1960-70)                              97+ 
#>                           318097                                0 
#>                          Unknown            NIU (not in universe) 
#>                             6459                          2070836

请注意，仅 DDI 无法提供网络上的可用性/频率，您需要计算那些来自数据。

使用 JavaScript 生成的表格的网络抓取

web scraping of tables generated using JavaScript

javascript

r

web-scraping

rvest