如何通过在 R 中使用循环有效地进行子集化?
How to subsetting efficiently by using loop in R?
我有一个名为 "table_parameter" 的 csv 文件。 Please, download from here. 数据如下所示:
time avg.PM10 sill range nugget
1 2012030101 52.2692307692308 0.11054330 45574.072 0.0372612157
2 2012030102 55.3142857142857 0.20250974 87306.391 0.0483153769
3 2012030103 56.0380952380952 0.17711558 56806.827 0.0349567088
4 2012030104 55.9047619047619 0.16466350 104767.669 0.0307528346
.
.
.
25 2012030201 67.1047619047619 0.14349774 72755.326 0.0300378129
26 2012030202 71.6571428571429 0.11373430 72755.326 0.0320594776
27 2012030203 73.352380952381 0.13893530 72755.326 0.0311135434
28 2012030204 70.2095238095238 0.12642303 29594.037 0.0281416079
.
.
在我的数据框中有一个名为 time 的变量,包含从 2012 年 3 月 1 日到 2012 年 3 月 7 日的小时值,以数字形式表示。例如 2012 年 3 月 1 日,1.00 a.m。写成2012030101等等。
从这个数据集中,我想要像下面的 table 这样的子集 (24*11) 数据帧:
例如,对于凌晨 1 点 (2012030101,2012030201....2012030701) 和 avg.PM10<10,我想要 1 个数据帧。在这种情况下,您可能发现对于某些数据框,将没有观察结果。不过没关系,因为我会处理非常大的数据集。
我可以通过像这样编写 (24*11)240 行代码来手动完成此子集化!
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times ==24 & avg.PM10>100)
但我知道这段代码效率很低。有什么方法可以通过循环高效地做到这一点?
仅供参考:实际上,在未来,通过使用这些 (24*11) 数据集,我想绘制一些图。
更新:在此子集之后,我想使用每个数据集的 range
绘制箱线图。但问题是,我想在一个图中像矩阵一样显示 range
的所有箱线图 (24*11)[如上图]!如果您有任何进一步的查询,请告诉我。提前致谢。
像这样的双循环怎么样:
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]
t_list=seq(1,24,1)
PM_list=seq(0,100,10)
for (t in t_list){
#t=t_list[1]
for (PM in PM_list){
#PM=PM_list[4]
PM2=PM+10
sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
if (length(sub$X)!=0) { #to avoid errors because of empty sub
name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
sub$name = name
sub.df <- rbind(sub.df , sub) }
}
}
sub.df #print data frame
您可以使用一些 plyr、dplyr 和 tidyr 魔法来做到这一点:
library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway
# Read data
dfData <- read.csv("table_parameter.csv")
dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(hour, roundedPM.10) %>%
# Count the number of occurences per hour
count(roundedPM.10, hour) %>%
# Use spread (from tidyr) to transform it into wide format
spread(hour, n)
如果你打算使用 ggplot2,你可以忘记 tidyr 和代码的最后一行,以保持数据帧的长格式,这样绘图会更容易。
编辑:阅读您的评论后,我意识到我误解了您的问题。这将为您提供每几个小时和 AVG.PM10 间隔的箱线图:
library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it
# for the round_any function anyway
# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")
dfDataPlot <- dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(roundedPM.10, hour, range)
# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) +
geom_boxplot() +
facet_grid(roundedPM.10~.)
我有一个名为 "table_parameter" 的 csv 文件。 Please, download from here. 数据如下所示:
time avg.PM10 sill range nugget
1 2012030101 52.2692307692308 0.11054330 45574.072 0.0372612157
2 2012030102 55.3142857142857 0.20250974 87306.391 0.0483153769
3 2012030103 56.0380952380952 0.17711558 56806.827 0.0349567088
4 2012030104 55.9047619047619 0.16466350 104767.669 0.0307528346
.
.
.
25 2012030201 67.1047619047619 0.14349774 72755.326 0.0300378129
26 2012030202 71.6571428571429 0.11373430 72755.326 0.0320594776
27 2012030203 73.352380952381 0.13893530 72755.326 0.0311135434
28 2012030204 70.2095238095238 0.12642303 29594.037 0.0281416079
.
.
在我的数据框中有一个名为 time 的变量,包含从 2012 年 3 月 1 日到 2012 年 3 月 7 日的小时值,以数字形式表示。例如 2012 年 3 月 1 日,1.00 a.m。写成2012030101等等。
从这个数据集中,我想要像下面的 table 这样的子集 (24*11) 数据帧:
例如,对于凌晨 1 点 (2012030101,2012030201....2012030701) 和 avg.PM10<10,我想要 1 个数据帧。在这种情况下,您可能发现对于某些数据框,将没有观察结果。不过没关系,因为我会处理非常大的数据集。
我可以通过像这样编写 (24*11)240 行代码来手动完成此子集化!
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times ==24 & avg.PM10>100)
但我知道这段代码效率很低。有什么方法可以通过循环高效地做到这一点?
仅供参考:实际上,在未来,通过使用这些 (24*11) 数据集,我想绘制一些图。
更新:在此子集之后,我想使用每个数据集的 range
绘制箱线图。但问题是,我想在一个图中像矩阵一样显示 range
的所有箱线图 (24*11)[如上图]!如果您有任何进一步的查询,请告诉我。提前致谢。
像这样的双循环怎么样:
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]
t_list=seq(1,24,1)
PM_list=seq(0,100,10)
for (t in t_list){
#t=t_list[1]
for (PM in PM_list){
#PM=PM_list[4]
PM2=PM+10
sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
if (length(sub$X)!=0) { #to avoid errors because of empty sub
name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
sub$name = name
sub.df <- rbind(sub.df , sub) }
}
}
sub.df #print data frame
您可以使用一些 plyr、dplyr 和 tidyr 魔法来做到这一点:
library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway
# Read data
dfData <- read.csv("table_parameter.csv")
dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(hour, roundedPM.10) %>%
# Count the number of occurences per hour
count(roundedPM.10, hour) %>%
# Use spread (from tidyr) to transform it into wide format
spread(hour, n)
如果你打算使用 ggplot2,你可以忘记 tidyr 和代码的最后一行,以保持数据帧的长格式,这样绘图会更容易。
编辑:阅读您的评论后,我意识到我误解了您的问题。这将为您提供每几个小时和 AVG.PM10 间隔的箱线图:
library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it
# for the round_any function anyway
# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")
dfDataPlot <- dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(roundedPM.10, hour, range)
# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) +
geom_boxplot() +
facet_grid(roundedPM.10~.)