将一个不规则的文本数据文件读入R
Read an irregular text data file into R
我正在尝试 "import" 来自 non-data.frame 形状文本文件的数据,其中包含多个降水率报告。报告都是平等的,其中一个示例如下:
I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
INFORMATION SYSTEM
PRECIPITATION TOTAL VALUES (mms) NATIONAL ENVIRONMENTAL
DATE OF PROCESS : 2015/09/15 YEAR 1980 STATION ID : 11010010 VUELTA LA
LAT 0527 N TIPO EST PM STATE CHOCO INSTALLATION DATE 1943-ENE
LON 7632 W ENT 01 IDEAM CITY LLORO FECHA-SUSPENSION
ELE 100 m.s.n.m REGIONAL 01 ANTIOQUIA CORRIENTE ANDAGUEDA
DAY JAN * FEB * MAR * APR * MAY * JUN * JUL * AGO * SEP * OCT * NOV * DEC *
01 30.0 .0 .0 3.0 80.0 .0 3.0 .0 35.0 88.0 1.0
02 .0 1.0 .0 1.0 100.0 .0 .0 6.0 1.0 65.0 69.0
03 35.0 100.0 .0 10.0 .0 .0 .0 70.0 40.0 42.0 16.0
04 .0 .0 80.0 3.0 140.0 8.0 .0 135.0 20.0 48.0 15.0
05 .0 .0 .0 8.0 3.0 20.0 4.0 19.0 80.0 .0 20.0
06 .0 .0 100.0 138.0 .0 6.0 .0 4.0 20.0 .0 10.0
07 31.0 10.0 .0 30.0 15.0 50.0 6.0 .0 4.0 .0 .0
08 .0 44.0 .0 10.0 40.0 .0 .0 .0 7.0 .0 4.0
09 35.0 3.0 23.0 .0 20.0 140.0 .0 6.0 .0 32.0 16.0
10 .0 75.0 .0 .0 60.0 .0 .0 23.0 3.0 1.0 5.0
11 .0 17.0 .0 15.0 80.0 .0 .0 80.0 .0 .0 3.0
12 .0 75.0 .0 8.0 .0 63.0 10.0 .0 .0 17.0 10.0
13 .0 20.0 .0 60.0 .0 .0 .0 110.0 50.0 3.0 25.0
14 55.0 .0 26.0 12.0 .0 3.0 140.0 4.0 74.0 .0 38.0
15 .0 .0 3.0 7.0 10.0 .0 6.0 .0 35.0 12.0 27.0
16 .0 4.0 89.0 20.0 3.0 .0 .0 10.0 .0 .0 .0
17 45.0 .0 9.0 .0 30.0 .0 2.0 .0 60.0 103.0 .0
18 30.0 .0 .0 .0 21.0 .0 20.0 15.0 .0 .0 .0
19 .0 130.0 .0 10.0 12.0 8.0 .0 3.0 20.0 49.0 40.0
20 45.0 .0 25.0 190.0 .0 38.0 8.0 .0 8.0 3.0 1.0
21 1.0 .0 45.0 50.0 .0 35.0 .0 2.0 13.0 1.0 4.0
22 .0 .0 20.0 .0 .0 .0 .0 16.0 10.0 12.0 50.0
23 40.0 .0 40.0 16.0 .0 30.0 .0 13.0 2.0 106.0 10.0
24 .0 .0 45.0 60.0 .0 3.0 .0 25.0 .0 16.0 .0
25 .0 .0 .0 .0 18.0 10.0 .0 3.0 .0 50.0 20.0
26 10.0 .0 .0 .0 9.0 6.0 20.0 20.0 6.0 15.0 3.0
27 .0 135.0 60.0 40.0 80.0 15.0 .0 18.0 10.0 77.0 .0
28 10.0 .0 9.0 15.0 .0 .0 .0 6.0 72.0 102.0 .0
29 23.0 6.0 .0 .0 .0 .0 .0 23.0 .0 34.0 .0
30 .0 10.0 .0 20.0 3.0 .0 64.0 14.0 111.0 .0
31 .0 31.0 10.0 .0 .0 .0
*** ANNUAL VALUES ***
TOTAL 6954.0
No DE RAIN DAYS 210
MAX 24 Hrs 190.0
文本文件包含一个接一个的报告,所有报告都具有相同的 header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"
。我已经 "read" 使用 readLines()
函数的文本文件,我希望创建一个包含每个报告信息的数据框,如下所示:
DATE STATION_ID LAT LON ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010 0527 N 7632 W 100 LLORO CHOCO 0
我一直在尝试拆分每个报告,然后开始解析每一行。不幸的是,这是一个缓慢的过程。我知道此页面寻找带分隔符的问题,但我有点卡住了。
提前致谢。
这是一种方法。
- 使用
readLines()
阅读全页,56行。
- 通过了解纬度、经度、海拔、城市、州和年份的行号和行中的位置,确定来自 header 的信息。使用
substr()
- 用那里得到的年份,写出那一年的所有日期。
cbind
包含 header 信息。
- 使用一个函数,获取月份的日期和月份编号,并在页面上定位相应的降水量。行号为
14 + dayOfMonth
,水平偏移量可以是12个数字的向量,每个月一个。将该列添加到您的页面。
如果你 rbind
浏览每一页,你最终会得到一个长(!)整齐的数据集。 [edit] 如果你的数据集很大,你也会花费永恒的时间来管理内存。相反,您可以创建一个数据框列表并在最后绑定它们。有关详细信息,请参阅 this question and this question。
这是我想出的一些代码:您可以先在简短的摘录中对其进行测试。
library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)
framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)
framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\s+|\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)
# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)
# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set
page <- cbind(pagedata,rainfall)
page
}
raw <- readLines("area1.txt") # read in all the data
# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")
listOfDataFrames <- vector(mode = "list", length = nrow(headers))
# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
start <- headers[i,]
end <- start + 56
listOfDataFrames[[i]] <- raw2page(raw[start:end])
}
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))
我正在尝试 "import" 来自 non-data.frame 形状文本文件的数据,其中包含多个降水率报告。报告都是平等的,其中一个示例如下:
I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
INFORMATION SYSTEM
PRECIPITATION TOTAL VALUES (mms) NATIONAL ENVIRONMENTAL
DATE OF PROCESS : 2015/09/15 YEAR 1980 STATION ID : 11010010 VUELTA LA
LAT 0527 N TIPO EST PM STATE CHOCO INSTALLATION DATE 1943-ENE
LON 7632 W ENT 01 IDEAM CITY LLORO FECHA-SUSPENSION
ELE 100 m.s.n.m REGIONAL 01 ANTIOQUIA CORRIENTE ANDAGUEDA
DAY JAN * FEB * MAR * APR * MAY * JUN * JUL * AGO * SEP * OCT * NOV * DEC *
01 30.0 .0 .0 3.0 80.0 .0 3.0 .0 35.0 88.0 1.0
02 .0 1.0 .0 1.0 100.0 .0 .0 6.0 1.0 65.0 69.0
03 35.0 100.0 .0 10.0 .0 .0 .0 70.0 40.0 42.0 16.0
04 .0 .0 80.0 3.0 140.0 8.0 .0 135.0 20.0 48.0 15.0
05 .0 .0 .0 8.0 3.0 20.0 4.0 19.0 80.0 .0 20.0
06 .0 .0 100.0 138.0 .0 6.0 .0 4.0 20.0 .0 10.0
07 31.0 10.0 .0 30.0 15.0 50.0 6.0 .0 4.0 .0 .0
08 .0 44.0 .0 10.0 40.0 .0 .0 .0 7.0 .0 4.0
09 35.0 3.0 23.0 .0 20.0 140.0 .0 6.0 .0 32.0 16.0
10 .0 75.0 .0 .0 60.0 .0 .0 23.0 3.0 1.0 5.0
11 .0 17.0 .0 15.0 80.0 .0 .0 80.0 .0 .0 3.0
12 .0 75.0 .0 8.0 .0 63.0 10.0 .0 .0 17.0 10.0
13 .0 20.0 .0 60.0 .0 .0 .0 110.0 50.0 3.0 25.0
14 55.0 .0 26.0 12.0 .0 3.0 140.0 4.0 74.0 .0 38.0
15 .0 .0 3.0 7.0 10.0 .0 6.0 .0 35.0 12.0 27.0
16 .0 4.0 89.0 20.0 3.0 .0 .0 10.0 .0 .0 .0
17 45.0 .0 9.0 .0 30.0 .0 2.0 .0 60.0 103.0 .0
18 30.0 .0 .0 .0 21.0 .0 20.0 15.0 .0 .0 .0
19 .0 130.0 .0 10.0 12.0 8.0 .0 3.0 20.0 49.0 40.0
20 45.0 .0 25.0 190.0 .0 38.0 8.0 .0 8.0 3.0 1.0
21 1.0 .0 45.0 50.0 .0 35.0 .0 2.0 13.0 1.0 4.0
22 .0 .0 20.0 .0 .0 .0 .0 16.0 10.0 12.0 50.0
23 40.0 .0 40.0 16.0 .0 30.0 .0 13.0 2.0 106.0 10.0
24 .0 .0 45.0 60.0 .0 3.0 .0 25.0 .0 16.0 .0
25 .0 .0 .0 .0 18.0 10.0 .0 3.0 .0 50.0 20.0
26 10.0 .0 .0 .0 9.0 6.0 20.0 20.0 6.0 15.0 3.0
27 .0 135.0 60.0 40.0 80.0 15.0 .0 18.0 10.0 77.0 .0
28 10.0 .0 9.0 15.0 .0 .0 .0 6.0 72.0 102.0 .0
29 23.0 6.0 .0 .0 .0 .0 .0 23.0 .0 34.0 .0
30 .0 10.0 .0 20.0 3.0 .0 64.0 14.0 111.0 .0
31 .0 31.0 10.0 .0 .0 .0
*** ANNUAL VALUES ***
TOTAL 6954.0
No DE RAIN DAYS 210
MAX 24 Hrs 190.0
文本文件包含一个接一个的报告,所有报告都具有相同的 header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"
。我已经 "read" 使用 readLines()
函数的文本文件,我希望创建一个包含每个报告信息的数据框,如下所示:
DATE STATION_ID LAT LON ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010 0527 N 7632 W 100 LLORO CHOCO 0
我一直在尝试拆分每个报告,然后开始解析每一行。不幸的是,这是一个缓慢的过程。我知道此页面寻找带分隔符的问题,但我有点卡住了。
提前致谢。
这是一种方法。
- 使用
readLines()
阅读全页,56行。 - 通过了解纬度、经度、海拔、城市、州和年份的行号和行中的位置,确定来自 header 的信息。使用
substr()
- 用那里得到的年份,写出那一年的所有日期。
cbind
包含 header 信息。 - 使用一个函数,获取月份的日期和月份编号,并在页面上定位相应的降水量。行号为
14 + dayOfMonth
,水平偏移量可以是12个数字的向量,每个月一个。将该列添加到您的页面。
如果你 rbind
浏览每一页,你最终会得到一个长(!)整齐的数据集。 [edit] 如果你的数据集很大,你也会花费永恒的时间来管理内存。相反,您可以创建一个数据框列表并在最后绑定它们。有关详细信息,请参阅 this question and this question。
这是我想出的一些代码:您可以先在简短的摘录中对其进行测试。
library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)
framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)
framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\s+|\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)
# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)
# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set
page <- cbind(pagedata,rainfall)
page
}
raw <- readLines("area1.txt") # read in all the data
# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")
listOfDataFrames <- vector(mode = "list", length = nrow(headers))
# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
start <- headers[i,]
end <- start + 56
listOfDataFrames[[i]] <- raw2page(raw[start:end])
}
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))