R - 提取时间以及作为字符串一部分的时区

Question

我有一个很大的文本数据库，读取为一列文本的数据框，其中有几个句子，其中提到时间的格式如下：

第 1 行。我试图在 xxx-xxx-xxxx 上给您打电话，但是收到了语音信箱，我安排我们的下一次跟进时间是 2018 年 6 月 13 日太平洋标准时间中午 12 点到下午 2 点。

第 2 行。如果我收到他们的消息，我今天会再次给你打电话，如果没有，我会在东部时间明天下午 4 点到 6 点之间给你打电话。

第 3 行。我们将等待您的回复，如果我们没有收到您的回复，我们将在明天 12:00PM 到 2:00PM CST

之间给您打电话

第 4 行。正如在电话中讨论的那样，我们安排在明天美国东部时间下午 12 点到 02 点之间回电。

第 5 行。根据您的建议，我们将在太平洋标准时间 6/13/2018 中午 12 点至下午 2 点之间进行下一次跟进。

只想与 EST/CST/PST 一起提取时间部分。

Expected Outputs:

6/13/2018 4 PM - 6 PM EST
tomorrow 12 PM TO 2 PM PST

已尝试以下方法：

x <- text$string

sc1 <- str_match(x, " follow up on (.*?) T.")

其中 returns 类似于：

follow up on 6/13/2018 between 1 PM TO | 6/13/2018 between 1 PM

尝试使用以下代码组合其他格式

sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")

并进行行绑定以包含两种格式（跟进 * 并会打电话给您*）

sc1rb <- rbind(sc1,sc2)

没用k

有没有办法从上面的示例字符串中只提取时间部分和时区？

提前致谢！

Answer 1

这是适用于示例的内容。正如@MrFlick 提到的，请尝试以可重现的方式共享您的数据。

数据

> dput(txt)
c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.", 
"will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST", 
"will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST."
)

代码

> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt))
[[1]]
[1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST" 

[[2]]
[1] " 4 - 6PM EST" "4 - 6PM EST" 

[[3]]
character(0)

[[4]]
[1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST" 

[[5]]
[1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"

输出是一个列表，其中每个元素都有两个字符向量（阅读 regmatches 的帮助部分）。您可以进一步简化它以仅获得上面指示的输出：

> unname(sapply(txt, function(z){
   pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})'
   k <- unlist(regmatches(z, regexec(pattern = pattern, z)))
   return(k[2])
 }))
[1] "12 PM and 2 PM PST"    "4 - 6PM EST"           "12:00PM to 2:00PM CST" "11 AM to 12 PM EST"   
[5] "12 PM TO 2 PM PST"

这基于示例输入。当然，如果输入太不规则，将很难使用单个正则表达式。如果您遇到这种情况，我建议您使用多个正则表达式函数，这些函数一个接一个地调用，具体取决于前面的函数是否 return NA。希望这会有所帮助！

Answer 2

此代码几乎适用于您的所有规范，但此子字符串“4 - 6PM EST”除外。我希望它对您的整个数据有用

  data=c(

  "Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",

  "will call you tomorrow between 4 - 6PM EST.",

  "will call you tomorrow between 12:00PM to 2:00PM CST",

  "will call you tomorrow between 11 AM to 12 PM EST",

  "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")



  #date exclusion with regex
  data=gsub( "*(\d{1,2}/\d{1,2}/\d{4})*", "", data)


  #parameters for exlusion and substitution#
  excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\.")
  replaced_input=c("  ","\'-","and","TO"," AM"," PM")
  replaced_output=c("","to","to","to","AM","PM")

  for (i in excluded_texts){
    data=gsub(i, "", data)}

  for (j in 1:length(replaced_input)){
    data=gsub(replaced_input[j],replaced_output[j],data)

  }

print(data)

Answer 3

sub(".*?(\d+\s*[PA:-].*)","\1",data)
[1] "12 PM and 2 PM PST."   "4 - 6PM EST."          "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST"    "12 PM TO 2 PM PST."

R - 提取时间以及作为字符串一部分的时区

R - Extract time along with timezone which is part of string

time

timezone

r

extract

match