拆分和提取日期（自由格式书写）和以数字形式提到的小时数到单独的列中 - R

Question

我有一个输入

            **ID**        **Input Text**
            1         08/18/2017 8 hours
            2         08/14/2017-10HRS
            3         8/28/17 through 9/1/17 8 hrs per day
            4         08/17/17-6hrs
            5         08/14/2017-8hrs 08/15/2017-8hrs 08/16/2017-8hrs
            6         7.27.2017 -8 hrs, 8.3.2017 8 hours, 8.14.2017 8hrs
            7         08/16/2017 7 hours 10 minutes
            8         8 hrs - 07/11/2017 and 8 hrs 07/12/2017
            9         08/14/17-8hrs // 08/15/17-8hrs
            10        08/14/2017- 7:45 hrs// 08/15/2017- 7:45 hrs//
            11        Wed,  8/16/17 …. Cx missed 6 hrs on 8/14/17… missed 8 hrs on 8/15/17
            12        08/10/2017     8 hrs 
            13        08/11/2017      2 hrs 
            14        08/16/2017      8 hrs
            15        08/07/2017- 4 hours missed- Doctors appt , 08/13/2017 8 hours - Incapacity ,      08/15/2017 -8 hours- Incapacity , 08/16/2017 -3 hours // Doctor
            16        Aug 1, 2017 – 7.75 hours
            17        Aug 2, 2017 – 1.75 hours
            18        Aug 3, 2017 – 3 hours
            19        Aug 4, 2017 – 4 hours
            20        Aug 7, 2017 – 7.75 hours

预期输出为：

到目前为止，我尝试拆分输入文本，希望使用 lubridate 将列转换为日期，但无法

dt$Date_lubridate <- mdy(dt$Time)

Warning message:
All formats failed to parse. No formats found.

想将列拆分为日期和编号，然后使用 lubridate 将日期列转换为日期，但由于日期格式的变化我被卡住了。

x<-dt$Time

sc1 <- sub("\-.*", "", x)

sc2 <- sub('.*-', '', x)

sc3 <- sub("\ .*", "", x)

fstat <- cbind.data.frame ("ID" = dt$ID, "Actual" = x, "Date" = sc1, "time" = sc2, "time2" = sc3)

尝试在 sc1 上使用：

library(lubridate)
parse_date_time(x = sc1,
                orders = c("d m y", "d B Y", "m/d/y"),
                locale = "eng")

但由于变化，我遇到了解析错误。

我想我到处都是，因为我缺少一些基本的操作，任何 nudge/help 朝着正确的方向都会有所帮助。

Answer 1

您可以使用正则表达式提取您想要的日期部分，然后使用mdy()进行转换。

library(stringr)

regDate = "([A-Z][a-z]{2}|\d{1,2})( |\/|\.)\d{1,2}(,|\/|\.) ?\d{2,4}"
str_extract(dt$Time, regDate) %>% unlist() %>% lubridate::mdy()

为了方便起见，最后使用 dplyr 管道。

拆分和提取日期（自由格式书写）和以数字形式提到的小时数到单独的列中 - R

Split & Extract date (Free form writing) and hours mentioned as numbers into separate columns - R

r

date

data-manipulation

lubridate