如何从网页中提取特定数据以使用 R 将其添加到已抓取的 table?

How to extract specific data from web page to add it to a scraped table with R?

我已经构建了一个脚本,可以从网络上托管的 table 中提取数据,我已经可以看到 table 了,但是为了补充它,我需要添加提供者数据作为已经构建的 table 的一列,我想知道如何提取提供程序数据以附加到我的 table

纸条 R

library(rvest)

urls.colombia.compra.microsoft <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                               0:11, 
                               "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")


base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~.x %>% read_html() %>% html_table())

base.colombia.compra.microsft

urls.colombia.compra.google <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                               0:11, 
                               "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_=")

base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~.x %>% read_html() %>% html_table())

base.colombia.compra.google

urls.colombia.compra.nube <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                                      0:11, 
                                      "&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_=")

base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~.x %>% read_html() %>% html_table())

base.colombia.compra.nube

base.consolidada.colombia.compra <- data.table::rbindlist(list(base.colombia.compra.microsft, 
                        base.colombia.compra.google, 
                        base.colombia.compra.nube), idcol = 'id')

base.consolidada.colombia.compra

all_urls <- paste0('https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/', base.consolidada.colombia.compra$`Orden de Compra`)

new_res <- purrr::map_df(all_urls, ~.x %>% read_html() %>% html_table() %>% .[[1]] %>% dplyr::mutate(order_number = basename(.x), .before = 1))

new_res

library(dplyr)

Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))

供应商数据如下所示:

Provider data image

由于存在重复的 order_numbers,具有不同的 artículos,对于您当前的结构,为请求的数据添加额外的列可能意味着在给定的所有行中重复相同的新数据order_number。为了满足您的要求,这似乎是最简单的接受条件。

如果没问题,那么您可以简单地添加到您的 mutate() 调用中,在匿名函数中,并添加那些额外的列。就个人而言,我选择添加到 mutate 但用显式​​函数替换匿名函数,如下所示。然后我将该函数传递给 map_df.

我还换掉了抓取所有 table,然后进行索引,以便使用返回 1 个节点的 class 选择器更有效地选择单个 table。

我最后添加了一个 tidy_node 函数,改编自 @hrbrmstr 给出的答案,其中我在单词之间保留 space,其中存在 br 样式在 html。该函数添加了一个 , 以保持可读性。这需要额外的库参考。

library(xml2)

tidy_node <- function(node){
  xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
  xml_find_all(node, ".//br") %>% xml_remove()
  return(node)
}

get_order_details <- function(url) {
  page <- url %>%
    read_html()
  additional_columns <- page %>% html_elements("#supplier .oc-span")
  table <- page %>%
    html_element(".sticky-enabled") %>%
    html_table() %>%
    dplyr::mutate(
      order_number = basename(url), .before = 1,
      Nombre = additional_columns[1] %>% html_text(),
      `Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
      `Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
      `Teléfono (Celular)` = additional_columns[4] %>% html_text()
    )
}

new_res <- purrr::map_df(all_urls, get_order_details)

R:

library(rvest)
library(purrr)
library(dplyr)
library(xml2)

tidy_node <- function(node){
  xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
  xml_find_all(node, ".//br") %>% xml_remove()
  return(node)
}

get_order_details <- function(url) {
  page <- url %>%
    read_html()
  additional_columns <- page %>% html_elements("#supplier .oc-span")
  table <- page %>%
    html_element(".sticky-enabled") %>%
    html_table() %>%
    dplyr::mutate(
      order_number = basename(url), .before = 1,
      Nombre = additional_columns[1] %>% html_text(),
      `Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
      `Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
      `Teléfono (Celular)` = additional_columns[4] %>% html_text()
    )
}


urls.colombia.compra.microsoft <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="
)

base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.microsft

urls.colombia.compra.google <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_="
)

base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.google

urls.colombia.compra.nube <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_="
)

base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.nube

base.consolidada.colombia.compra <- data.table::rbindlist(list(
  base.colombia.compra.microsft,
  base.colombia.compra.google,
  base.colombia.compra.nube
), idcol = "id")

base.consolidada.colombia.compra

all_urls <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/", base.consolidada.colombia.compra$`Orden de Compra`)

new_res <- purrr::map_df(all_urls, get_order_details)

new_res

Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))