从字符串向量中提取日期

Question

我有两个元素的矢量。每个元素包含一串字符有两组日期。我需要提取这两个日期中的后者，并用它们制作一个新的向量或列表。

#webextract vector
webextract <- list("The Employment Situation, December 2006       January  5  \t 8:30 am\r","The Employment Situation, January 2007        \tFeb.  2, 2007\t 8:30 am            \r") 

#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006       January  5  \t 8:30 am\r

[[2]]
[1] The Employment Situation, January 2007        \tFeb.  2, 2007\t 8:30 am            \r

webextract 是网络抓取带有纯文本的 URL 的结果，这就是它看起来像那样的原因。我需要提取的是 "January 5" 和 "Feb. 2"。我一直在试验 grep 和 strsplit 但没有取得任何进展。没有成功地完成所有相关的 SO 问题。感谢您的帮助。

Answer 1

我们可以在 'webextract'

中的 unlist 之后尝试使用 gsub

gsub("^\D+\d+\s+|(,\s+\d+)*\D+\d+:.*$", "", unlist(webextract))
#[1] "January  5" "Feb.  2"

从字符串向量中提取日期

Extract dates from a vector of character strings

string

parsing

r

vector

extraction