R - 提取时间以及作为字符串一部分的时区
R - Extract time along with timezone which is part of string
我有一个很大的文本数据库,读取为一列文本的数据框,其中有几个句子,其中提到时间的格式如下:
第 1 行。我试图在 xxx-xxx-xxxx 上给您打电话,但是收到了语音信箱,我安排我们的下一次跟进时间是 2018 年 6 月 13 日太平洋标准时间中午 12 点到下午 2 点。
第 2 行。如果我收到他们的消息,我今天会再次给你打电话,如果没有,我会在东部时间明天下午 4 点到 6 点之间给你打电话。
第 3 行。我们将等待您的回复,如果我们没有收到您的回复,我们将在明天 12:00PM 到 2:00PM CST
之间给您打电话
第 4 行。正如在电话中讨论的那样,我们安排在明天美国东部时间下午 12 点到 02 点之间回电。
第 5 行。根据您的建议,我们将在太平洋标准时间 6/13/2018 中午 12 点至下午 2 点之间进行下一次跟进。
只想与 EST/CST/PST 一起提取时间部分。
Expected Outputs:
6/13/2018 4 PM - 6 PM EST
tomorrow 12 PM TO 2 PM PST
已尝试以下方法:
x <- text$string
sc1 <- str_match(x, " follow up on (.*?) T.")
其中 returns 类似于:
follow up on 6/13/2018 between 1 PM TO | 6/13/2018 between 1 PM
尝试使用以下代码组合其他格式
sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")
并进行行绑定以包含两种格式(跟进 * 并会打电话给您*)
sc1rb <- rbind(sc1,sc2)
没用k
有没有办法从上面的示例字符串中只提取时间部分和时区?
提前致谢!
这是适用于示例的内容。正如@MrFlick 提到的,请尝试以可重现的方式共享您的数据。
数据
> dput(txt)
c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST."
)
代码
> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt))
[[1]]
[1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST"
[[2]]
[1] " 4 - 6PM EST" "4 - 6PM EST"
[[3]]
character(0)
[[4]]
[1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST"
[[5]]
[1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"
输出是一个列表,其中每个元素都有两个字符向量(阅读 regmatches
的帮助部分)。您可以进一步简化它以仅获得上面指示的输出:
> unname(sapply(txt, function(z){
pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})'
k <- unlist(regmatches(z, regexec(pattern = pattern, z)))
return(k[2])
}))
[1] "12 PM and 2 PM PST" "4 - 6PM EST" "12:00PM to 2:00PM CST" "11 AM to 12 PM EST"
[5] "12 PM TO 2 PM PST"
这基于示例输入。当然,如果输入太不规则,将很难使用单个正则表达式。如果您遇到这种情况,我建议您使用多个正则表达式函数,这些函数一个接一个地调用,具体取决于前面的函数是否 return NA
。希望这会有所帮助!
此代码几乎适用于您的所有规范,但此子字符串“4 - 6PM EST”除外。我希望它对您的整个数据有用
data=c(
"Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.",
"will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST",
"Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")
#date exclusion with regex
data=gsub( "*(\d{1,2}/\d{1,2}/\d{4})*", "", data)
#parameters for exlusion and substitution#
excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\.")
replaced_input=c(" ","\'-","and","TO"," AM"," PM")
replaced_output=c("","to","to","to","AM","PM")
for (i in excluded_texts){
data=gsub(i, "", data)}
for (j in 1:length(replaced_input)){
data=gsub(replaced_input[j],replaced_output[j],data)
}
print(data)
sub(".*?(\d+\s*[PA:-].*)","\1",data)
[1] "12 PM and 2 PM PST." "4 - 6PM EST." "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST" "12 PM TO 2 PM PST."
我有一个很大的文本数据库,读取为一列文本的数据框,其中有几个句子,其中提到时间的格式如下:
第 1 行。我试图在 xxx-xxx-xxxx 上给您打电话,但是收到了语音信箱,我安排我们的下一次跟进时间是 2018 年 6 月 13 日太平洋标准时间中午 12 点到下午 2 点。
第 2 行。如果我收到他们的消息,我今天会再次给你打电话,如果没有,我会在东部时间明天下午 4 点到 6 点之间给你打电话。
第 3 行。我们将等待您的回复,如果我们没有收到您的回复,我们将在明天 12:00PM 到 2:00PM CST
之间给您打电话第 4 行。正如在电话中讨论的那样,我们安排在明天美国东部时间下午 12 点到 02 点之间回电。
第 5 行。根据您的建议,我们将在太平洋标准时间 6/13/2018 中午 12 点至下午 2 点之间进行下一次跟进。
只想与 EST/CST/PST 一起提取时间部分。
Expected Outputs:
6/13/2018 4 PM - 6 PM EST
tomorrow 12 PM TO 2 PM PST
已尝试以下方法:
x <- text$string
sc1 <- str_match(x, " follow up on (.*?) T.")
其中 returns 类似于:
follow up on 6/13/2018 between 1 PM TO | 6/13/2018 between 1 PM
尝试使用以下代码组合其他格式
sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")
并进行行绑定以包含两种格式(跟进 * 并会打电话给您*)
sc1rb <- rbind(sc1,sc2)
没用k
有没有办法从上面的示例字符串中只提取时间部分和时区?
提前致谢!
这是适用于示例的内容。正如@MrFlick 提到的,请尝试以可重现的方式共享您的数据。
数据
> dput(txt)
c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST."
)
代码
> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt))
[[1]]
[1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST"
[[2]]
[1] " 4 - 6PM EST" "4 - 6PM EST"
[[3]]
character(0)
[[4]]
[1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST"
[[5]]
[1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"
输出是一个列表,其中每个元素都有两个字符向量(阅读 regmatches
的帮助部分)。您可以进一步简化它以仅获得上面指示的输出:
> unname(sapply(txt, function(z){
pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})'
k <- unlist(regmatches(z, regexec(pattern = pattern, z)))
return(k[2])
}))
[1] "12 PM and 2 PM PST" "4 - 6PM EST" "12:00PM to 2:00PM CST" "11 AM to 12 PM EST"
[5] "12 PM TO 2 PM PST"
这基于示例输入。当然,如果输入太不规则,将很难使用单个正则表达式。如果您遇到这种情况,我建议您使用多个正则表达式函数,这些函数一个接一个地调用,具体取决于前面的函数是否 return NA
。希望这会有所帮助!
此代码几乎适用于您的所有规范,但此子字符串“4 - 6PM EST”除外。我希望它对您的整个数据有用
data=c(
"Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.",
"will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST",
"Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")
#date exclusion with regex
data=gsub( "*(\d{1,2}/\d{1,2}/\d{4})*", "", data)
#parameters for exlusion and substitution#
excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\.")
replaced_input=c(" ","\'-","and","TO"," AM"," PM")
replaced_output=c("","to","to","to","AM","PM")
for (i in excluded_texts){
data=gsub(i, "", data)}
for (j in 1:length(replaced_input)){
data=gsub(replaced_input[j],replaced_output[j],data)
}
print(data)
sub(".*?(\d+\s*[PA:-].*)","\1",data)
[1] "12 PM and 2 PM PST." "4 - 6PM EST." "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST" "12 PM TO 2 PM PST."