从长文件名中提取日期

Question

我已经阅读了这里关于从文件名中提取日期（或不同部分）的其他一些问题，但我似乎无法获得任何其他答案来处理我的文件名。我有一个目录中超过 15,000 个文件名的列表，我需要从文件名中提取日期，这样我就可以找出我缺少的日期（我总共应该有 15,706 个，但在某些目录中。我只有 ~15,600 )

这是一个例子

maxTemps <- list.files("./Daily/Daily_TMax/", recursive = TRUE, pattern = ".asc$", full.names = FALSE)
length(maxTemps)
[1] 15697

head(maxTemps)
[1] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700101.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700102.asc"
[3] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700103.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700104.asc"
[5] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700105.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700106.asc"

tail(maxTemps)
[1] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121226.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121227.asc"
[3] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121228.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121229.asc"
[5] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121230.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121231.asc"

我已经能够使用以下代码获取年份（基于文件夹）

regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp))

我想我可以使用它，从 invert = TRUE 到 return 其余的字符串，因为如果我尝试在 regexpr 中包含文件名的常量部分，我会得到一个错误

maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}\/(eMAST_ANUClimate_day_tmax_v1m0_)", maxTemp), invert = TRUE)
Error: '\/' is an unrecognized escape in character string starting ""[0-9]{4}\/"

所以我想我可以使用有效的代码，然后子集化文件名的常量部分，留下日期，然后我只需要用 sub 删除 .asc，但是这 return 一些乱七八糟的文字

maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp), invert = TRUE)
maxTempsFiles <- sub(x = maxTempsFiles, pattern = "/eMAST_ANUClimate_day_tmax_v1m0_", replacement = "")
maxTempsFiles <- sub(x = maxTempsFiles, pattern = ".asc", replacement = "")
head(maxTempsFiles)
[1] "c(\"\", \"19700101\")" "c(\"\", \"19700102\")" "c(\"\", \"19700103\")" "c(\"\", \"19700104\")" "c(\"\", \"19700105\")"
[6] "c(\"\", \"19700106\")"

文件中总是有 /eMAST_ANUClimate_day_prec_v1m0_，只是第一个文件夹发生变化，文件名末尾 19700101.asc 到 20121231.asc

如果有人可以提供一些 code/advice 如何最好地做到这一点，那就太好了。

Answer 1

这是使用组搜索字符串的部分匹配项的简单方法 - return在组中搜索所需的匹配项。

gsub("(^.*_)(\d+)\.asc$", "\2", x)

正则表达式解释：

group 1:
  (^.*_) - match beginning of string (^) and then any character until _ is found
group 2:
  (\d+) - find any digit, several times (+)
no group:
  \.asc$ - at last, find .asc, which should be the end of the string ($)

replacement gsub 中的参数用于替换字符串的匹配部分，或者 return 所需的组。对于第 2 组，您需要 \2。 sub 和 gsub 之间的区别在于前者将 return 仅第一个匹配的模式，而 gsub 将对整个向量起作用。

从长文件名中提取日期

Extracting dates from long filename

regex

filenames

r

extract