基于日期 R 的列表数据的条件子集

Question

我有几个包含每小时数据的 .csv 文件。每个文件代表 space 中一个点的数据。每个文件的开始和结束日期不同。

可以使用以下方法将数据读入 R：

lstf1<- list.files(pattern=".csv")

lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))

head(lst2[[800]])
             datetime precip code
1 2003-12-30 00:00:00     NA    M
2 2003-12-30 01:00:00     NA    M
3 2003-12-30 02:00:00     NA    M
4 2003-12-30 03:00:00     NA    M
5 2003-12-30 04:00:00     NA    M
6 2003-12-30 05:00:00     NA    M

datetime是YYYY-MM-DD-HH-MM-SS，precip是数据值，code可以忽略。

对于 lst2 中的每个数据帧 (df)，我想根据以下条件 select 期间 2015-04-01 到 2015-11-30 的数据：

1）如果df中的precip包含了这段时间的所有NAs，则删除（不要select） 2）如果precip不全是NAsselect就可以了。

所需的输出 (lst3) 包含 2015-04-01 到 2015-11-30 期间的子集数据。

lst3 中的所有数据帧应具有与 days 和 hours 相同的长度，没有 precip 表示为 NA

我可以将 lst3 中的文件写入我的目录，使用类似：

sapply(names(lst2),function (x)  write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))

The link to a sample file can be found here (~200 KB)

Answer 1

根据您所写的内容，如果 precip 列 中存在数据，听起来您只是对文件列表的子集感兴趣这个特定的日期范围.

> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x)  write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))

如果你想有一个动态的开始和结束时间，将具有这些值的变量扔到 valueExist 函数中，并用相同的变量替换 lst3 赋值中的字符串时间戳.

如果您想将两个 lapply 循环合二为一，请客气，但我更喜欢在子集化时使用布尔变量。

Answer 2

有点难以准确理解您要做什么，但是您提供的文件中的这个示例（使用 dplyr，它具有很好的过滤器语法）应该让您接近：

library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")

# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip), 
                     datetime >= as.POSIXct("2015-04-01"),
                     datetime < as.POSIXct("2015-12-01"))

# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
    df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"), 
                            datetime < as.POSIXct("2015-12-01"))
    # Then add df.result to your list of data frames...
} # else, don't add it to your list

我想你是说如果也有有效的预测值，你想在数据框中保留 NAs——你只想在整个期间都有 NAs 的情况下丢弃。如果你只是想去除所有的 NA，那么只需要使用第一个过滤语句就可以了。如果你已经用另一种方式正确编码了你的日期，你显然不需要使用 POSIXct。

编辑：带函数包装器，因此您可以使用 lapply:

library(dplyr)

# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA

# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)  

# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
    if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) > 
        0) {
        return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
    }
}

# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x) 

# Filter the list
results <- compact(lapply(df.list, filterprecip))

# Check that you got a list of 2 data frames in the right date range
str(results)

基于日期 R 的列表数据的条件子集

Conditional subset of data from list base on date R

r

date

subset