从 R 中的文本解析日期

Question

我反复遇到从相对非结构化的文本文档中解析日期的问题，其中日期嵌入在文本中，其位置和格式因情况而异。一些示例文本是：

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

我想从文本中提取日期字符串 "July 1st, 2015"（第 1 步）并将其转换为格式，例如 2015-07-01 UTC（第 2 步）。例如，可以使用包 lubridate 中的 parse_date_time 执行第 2 步（这对多种适用的日期格式非常有用）：

案例一：

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

在某些情况下，parse_date_time 也适用于包含日期的较大字符串。例如：

案例二：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

但是，据我了解，第 2 步不能直接作用于完整的示例文本：

案例三：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

显然，文本中的一些附加信息使得直接从全文中解析日期变得很麻烦。我可以想到一种方法，其中使用正则表达式执行第 1 步以提取包含日期的简化字符串（类似于案例 1 或案例 2），parse_date_time 适用于该日期。但是，将正则表达式与日期结合使用似乎总是有点脏，因为正则表达式不知道它是否提取有效日期。

有没有办法像上面的例子（案例 3）那样直接对非结构化文本执行步骤 2（即，没有基于正则表达式的解决方法）？

非常感谢任何意见！

Answer 1

Using this website, we can construct some regex code: (( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) but it doesn't work in R... :(

如果更正它确实有效。

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"

从 R 中的文本解析日期

Parsing Dates from Text in R

regex

parsing

r

date

lubridate