如何将多个值作为 ID 传递给 rvest

How to pass multiple value as ID to `rvest`

我想使用 PID(唯一 ID)从网站 https://www.bcassessment.ca/ 中提取一堆数字。 PID样本列表如下:

PID <- c("012-215-023", "024-521-647", "025-891-669")

对于这些值,我手动打开了网站,在网站的搜索引擎中,我选择了 PID 从可用选项列表中,然后搜索这些数字。搜索将我重定向到以下 URL

URL <- c("https://www.bcassessment.ca//Property/Info/QTAwMDAwM1hIUA==",
         "https://www.bcassessment.ca//Property/Info/QTAwMDAwNEJKMA==",
         "https://www.bcassessment.ca//Property/Info/QTAwMDAwMUc5OA==")

然后对于这些 URL 中的每一个,我 运行 下面显示的代码,以提取 属性:

的总值
out <- c()
for (i in 1: length(URL)) {
  
  url <- URL[i]
  out[i] <- url %>%
    read_html %>%
    html_nodes('span#lblTotalAssessedValue') %>%
    html_text()
  i <- i+1 
}

这给了我最终的结果

[1] "3,000" "7,000" "7,000"

问题是我有一个 PID 的列表(超过 50000),我无法在网站中手动搜索这些 PID 中的每一个以找到实际的 link 然后 运行 rvest 抓取 它。您如何建议自动化此过程,以便我只能提供 PID 并获得输出价格?

总结: 获取已知列表 PID 我想打开 https://www.bcassessment.ca/ 并提取 [=] 的最新价格58=],我希望它自动完成。

Test_PID

我添加了 PID 代码列表,因此您可以检查是否要检查代码是否正常工作:

structure(list(P.I.D.. = c("004-050-541", "016-658-540", "016-657-861", 
"016-657-764", "019-048-386", "025-528-360", "800-058-036", "025-728-954", 
"028-445-783", "027-178-048", "028-445-571", "025-205-145", "015-752-798", 
"026-041-308", "024-521-698", "027-541-631", "024-360-651", "028-445-040", 
"025-851-411", "025-529-293", "024-138-436", "023-893-796", "018-496-768", 
"025-758-721", "024-219-665", "024-359-866", "018-511-015", "026-724-979", 
"023-894-253", "006-331-505", "025-961-012", "024-219-690", "027-309-878", 
"028-445-716", "025-759-060", "017-692-733", "025-728-237", "028-447-221", 
"023-894-202", "028-446-020", "026-827-611", "028-058-798", "017-574-412", 
"023-893-591", "018-511-457", "025-960-199", "027-178-714", "027-674-941", 
"027-874-826", "025-110-390", "028-071-336", "018-257-984", "023-923-393", 
"026-367-203", "027-601-854", "003-773-922", "025-902-989", "018-060-641", 
"025-530-003", "018-060-722", "025-960-423", "016-160-126", "009-301-461", 
"025-960-580", "019-090-315", "023-464-283", "028-445-503", "006-395-708", 
"028-446-674", "018-258-549", "023-247-398", "029-321-166", "024-519-871", 
"023-154-161", "003-904-547", "004-640-357", "006-314-864", "025-960-521", 
"013-326-783", "003-430-049", "027-490-084", "024-360-392", "028-054-474", 
"026-076-179", "005-309-689", "024-613-509", "025-978-551", "012-215-066", 
"024-034-002", "025-847-244", "024-222-038", "003-912-019", "024-845-264", 
"006-186-254", "026-826-691", "026-826-712", "024-575-569", "028-572-581", 
"026-197-774", "009-695-958", "016-089-120", "025-703-811", "024-576-671", 
"026-460-751", "026-460-149", "003-794-181", "018-378-684", "023-916-745", 
"003-497-721", "003-397-599", "024-982-211", "018-060-129", "018-061-231", 
"017-765-714", "027-303-799", "028-565-312", "018-061-010", "006-338-232", 
"023-680-024", "028-983-971", "028-092-490", "006-293-239", "018-061-257", 
"028-092-376", "018-060-137", "004-302-664", "016-988-060", "003-371-166", 
"027-325-342", "011-475-480", "018-060-200")), row.names = c(NA, 
-131L), class = c("tbl_df", "tbl", "data.frame"))

P.S。我说的这个网站是一个public的网站,大家打开后加个地址就可以找到一个属性的估价,所以我觉得抓取没有问题,因为它是一个public 数据库。

当您通过表单提交 pid 时,会触发以下调用:

GET https://www.bcassessment.ca/Property/Search/GetByPid/012215023?PID=012215023&_=1619713418473

上面的调用有以下参数:

  • 012215023 是您输入的没有破折号 - 的 PID。它既是路径又是查询参数
  • 1619713418473 是自 1970 年以来的当前时间戳(以毫秒为单位)(unix 时间戳)

上述调用的结果是 json 响应,如下所示:

{
    "sEcho": 1,
    "aaData": [
        ["XXXXXXX", "XXXXXXXX", "XXXXXXXXXXXX", "200-027-615-115-48-0004", "QTAwMDAwM1hIUA=="]
    ]
}

上面调用 returns 作为 text/plain 而不是 application/json 内容类型的响应,所以我们必须使用 jsonlite 来解析它。然后选择 aaData 数组值的最后一项,在本例中为:QTAwMDAwM1hIUA== 并像 post.

中的那样构建结果 url

以下代码获取 PID 列表并提取每个 PID 的 $ 值:

library(rvest)

getValueForPID <- function(pid) {
  pidNum = gsub("-", "", pid)

  time <- as.numeric(as.POSIXct(Sys.time()))*1000

  output <- content(httr::GET(paste0("https://www.bcassessment.ca/Property/Search/GetByPid/",pidNum), query = list(
    "PID" = pidNum,
    "_" = format(time, digits=13)
  )), "text", encoding = "UTF-8")

  if(output == "found_no_results"){
    return("")
  }
  data = jsonlite::fromJSON(output)

  id = data$aaData[5]

  text <- paste0("https://www.bcassessment.ca/Property/Info/", id) %>%
    read_html %>%
    html_nodes('span#lblTotalAssessedValue') %>%
    html_text()

  return(text)
}

PID <- c("004-050-541", "016-658-540", "016-657-861", "016-657-764", "019-048-386", "025-528-360", "800-058-036")

out <- c()
count <- 1
for (i in PID) {
  print(i)
  out[count] <- getValueForPID(i)
  count <- count + 1
}
print(out)

示例输出:

[1] "3,000" "7,000" "7,000"

kaggle link: https://www.kaggle.com/bertrandmartel/bcassesment-pid