将日期列从 Excel/CSV 转换为 R 会给出错误的日期

Question

我收到了一个 CSV 文件，我想在 R 中进行分析，但我遇到了以前从未遇到过的日期列问题。

在 Excel 中打开文件时，给定日期在单元格中显示为 22.12.2020 00:00，在公式栏中显示为 22.12.2020 00:00:00。当使用 dplyr::read_csv2 读入 R 时，它被读入为带有 class 字符的“22.12.2020 00:00”。当我尝试使用 lubridate::as_date 或 lubridate::as_datetime 将列转换为日期时间时，我分别得到 2022-12-20 和 2022-12-20 20:00:00。我猜这一定是由于初始字符串中缺少秒数。我试过在进行转换之前将“:00”添加到字符串的末尾，但这只会导致 NA。谁能告诉我如何解决这个问题？

test4 <- structure(list(ORDER_STATUS_DATE = 20201222, DAY = "22.12.2020 00:00"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

test4 %>% 
  mutate(DAY = as_datetime(DAY))

# Returns 2022-12-20 20:00:00 but should ideally have returned 2022-12-22 00:00:00

test4 %>% 
  mutate(DAY = as_date(DAY))

# Returns 2022-12-20

test4 %>% 
  mutate(DAY = DAY %>% paste0(":00:00"))
  
# Returns 22.12.2020 00:00:00:00 so converting to date or datetime leads to NAs

Answer 1

如果日期格式不明确，您需要指定它是什么 - lubridate 非常适合这个。

lubridate::dmy_hm("22.12.2020 00:00")
#> [1] "2020-12-22 UTC"

Answer 2

这里你不一定需要 lubridate（虽然它是一个很棒的库）：

as.POSIXct(test4$DAY, tz = "UTC",  format="%d.%m.%Y %H:%M")

Returns:

"2020-12-22 UTC"

将日期列从 Excel/CSV 转换为 R 会给出错误的日期

Converting a date column from Excel/CSV to R gives the wrong date

r

lubridate