HTML 文件中日期的 RVEST

Question

我正在尝试从已保存的 HTML 文件中删除日期。最后修改日期或发布日期，以文件具有的日期为准。我所拥有的不起作用，我认为这是因为它没有从 HTML 文件中读取脚本，而是只查看存储文件的目录。

我有一个目录，其中保存了数百个 HTML 需要处理的文件： HTML 姓名日期保存到目录的日期（这不太重要，因为我有原始的抓取日期）

我目前的代码：

library(magrittr)
library(rvest)
library(readxl)
library(tidyverse)

setwd("D:/URLtoDateTest")

file_list <- list.files(path = 'D:/URLtoDateTest')
html_data <- data.frame(file_list)

for (i in 1:length(html_names)){
  rawHTML <- lapply(html_data, function(html){
    read_html(html)
}

html_data$date <- read_html(file_list) %>%
  html_nodes("div.review-content-header__dates") %>%
  html_attr("datetime")

有什么建议吗？

我已将代码更新为以下内容：

library(rvest)
library(tidyverse)

setwd("D:/URLtoDateTest")

#get list of html files
htmlfiles <- list.files(pattern= "html$")

#loop through the list of files
returneddates <- sapply(htmlfiles, function(file){
  #read file and retrieve the date time
  datetime <-read_html(file) %>%
    html_nodes("div.review-content-header__dates") %>%
    html_attr("datetime")
  datetime
})

#conbine into a dataframe
answer <- data.frame(htmlfiles, returneddates)

我收到错误消息：

> answer <- data.frame(htmlfiles, returneddates)
Error in data.frame(htmlfiles, returneddates) : 
  arguments imply differing number of rows: 40, 0

最后是日期时间：

character(0)

要抓取的页面示例（修改为删除所有对实际 page/organization 所属的引用）。

</script>
    <title>Web Site Disclaimers | Other | ORG</title>
        <meta name="description" content="Page Title Example"/>     <meta name="keywords" content="ORG,  Full Organization Name,  other,  about,  about ORG.gov,  web site disclaimers,  flash disclaimers,  policies and regulations,  image reuse terms and conditions,  key matches,  linking,  linking to ORG.gov,  why link to ORG.gov,  how to link to ORG.gov,  graphic link to ORG.gov,  text link to describe ORG.gov,  questions or comments,  search tips,  site map,  web page badges and buttons,  other languages,  most spoken languages, Policies &amp; Guidelines"/>                       <meta name="robots" content="index, archive" />         <meta property="ORG:template_version" content="4.0"/>
    <meta property="ORG:last_updated" content="December 12, 2019"/>
    <meta property="ORG:last_reviewed" content="December 12, 2019"/>
    <meta property="ORG:content_source" content="Full Organizatio Name"/>
    <meta property="ORG:maintained_by" content="OFFICE OF COMMUNICATION; DIGITAL MEDIA BRANCH"/>
    <meta property="ORG:content_id" content="6318" />
            <link rel="canonical" href="https://www.ORG.gov/other/disclaimer.html"/>    <meta property="ORG:wcms_build" content="4.8.11 - b.2268" />
        <link rel="stylesheet" type="text/css" href="/other/wcms-inc/localrd.css"/>     <style>
.ui-tabs .ui-tabs-panel {height: 100%; max-height: 100%;} 
</style>        <!-- CSS Added Dynamically Here -->     <meta name="DC.date" content="2019-12-13T01:52:39Z" />
                <meta name="ORG:last_published" content="2019-12-13T15:31:13Z" />

Answer 1

函数read_html()只能取单值读取。在您上面的尝试中，您试图传递数据框或整个列表。

您的脚本应该类似于：

library(rvest)
library(tidyverse)

#get list of html files
htmlfiles <- list.files(pattern= "html$")

#loop through the list of files
returneddates <- sapply(htmlfiles, function(file){
   #read file and retrieve the date time
   datetime <-read_html(file) %>%
             html_node(xpath = ".//meta[@property = 'ORG:last_updated']") %>% 
             html_attr("content")
   datetime
})

#combine into a dataframe
answer <- data.frame(htmlfiles, returneddates)

调试帮助

   htmlfiles <- htmlfiles[1:2] #reduce the file list down for debugging
    
    returneddates <- sapply(htmlfiles, function(file){
       print(file)  #are you opening the correct file?
       #read file and retrieve the date time
       page <-read_html(file) 
       divReview <- page %>% html_nodes("div.review-content-header__dates") 
       print(divReview) #is a single node found?
       datetime <- divReview%>% html_attr("datetime")
       print(datetime)   #are you extracting the correct attribute?
       datetime
    })

** 更新 2 - 基于 html 片段**

最后修改日期存储在元标记的内容属性中。基于 html 片段来检索该信息，这应该有效

page<-read_html('   <title>Web Site Disclaimers | Other | ORG</title>
        <meta name="description" content="Page Title Example"/>     <meta name="keywords" content="ORG,  Full Organization Name,  other,  about,  about ORG.gov,  web site disclaimers,  flash disclaimers,  policies and regulations,  image reuse terms and conditions,  key matches,  linking,  linking to ORG.gov,  why link to ORG.gov,  how to link to ORG.gov,  graphic link to ORG.gov,  text link to describe ORG.gov,  questions or comments,  search tips,  site map,  web page badges and buttons,  other languages,  most spoken languages, Policies &amp; Guidelines"/>                       <meta name="robots" content="index, archive" />         <meta property="ORG:template_version" content="4.0"/>
    <meta property="ORG:last_updated" content="December 12, 2019"/>
    <meta property="ORG:last_reviewed" content="December 12, 2019"/>
    <meta property="ORG:content_source" content="Full Organizatio Name"/>
    <meta property="ORG:maintained_by" content="OFFICE OF COMMUNICATION; DIGITAL MEDIA BRANCH"/>
    <meta property="ORG:content_id" content="6318" />
            <link rel="canonical" href="https://www.ORG.gov/other/disclaimer.html"/>    <meta property="ORG:wcms_build" content="4.8.11 - b.2268" />
        <link rel="stylesheet" type="text/css" href="/other/wcms-inc/localrd.css"/>     <style>
.ui-tabs .ui-tabs-panel {height: 100%; max-height: 100%;} 
</style>        <!-- CSS Added Dynamically Here -->     <meta name="DC.date" content="2019-12-13T01:52:39Z" />
                <meta name="ORG:last_published" content="2019-12-13T15:31:13Z" />')

#get the meta node where the property attribute = ORG:last_updated,
# then retrieve the data stored in the content attribute.
lastupdate <- page %>% html_node(xpath = ".//meta[@property = 'ORG:last_updated']") %>% html_attr("content")
lastreview <- page %>% html_node(xpath = ".//meta[@property = 'ORG:last_reviewed']") %>% html_attr("content")

HTML 文件中日期的 RVEST

RVEST for Dates in HTML files

html

r

date

web-scraping

rvest