将一个不规则的文本数据文件读入R

Question

我正在尝试 "import" 来自 non-data.frame 形状文本文件的数据，其中包含多个降水率报告。报告都是平等的，其中一个示例如下：

  I D E A M  -  INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
                                                                                                          INFORMATION SYSTEM
                                  PRECIPITATION TOTAL VALUES (mms)                              NATIONAL ENVIRONMENTAL 

DATE OF PROCESS :  2015/09/15                    YEAR  1980                              STATION ID : 11010010  VUELTA LA

LAT    0527 N               TIPO EST    PM                   STATE      CHOCO                   INSTALLATION DATE   1943-ENE
LON   7632 W               ENT     01  IDEAM            CITY  LLORO                   FECHA-SUSPENSION
ELE   100 m.s.n.m         REGIONAL    01  ANTIOQUIA        CORRIENTE  ANDAGUEDA


      DAY       JAN *  FEB *  MAR *  APR *  MAY  *  JUN *  JUL *  AGO *  SEP *  OCT *  NOV *  DEC *


       01                 30.0       .0       .0      3.0     80.0       .0      3.0       .0     35.0     88.0      1.0
       02                   .0      1.0       .0      1.0    100.0       .0       .0      6.0      1.0     65.0     69.0
       03                 35.0    100.0       .0     10.0       .0       .0       .0     70.0     40.0     42.0     16.0
       04                   .0       .0     80.0      3.0    140.0      8.0       .0    135.0     20.0     48.0     15.0
       05                   .0       .0       .0      8.0      3.0     20.0      4.0     19.0     80.0       .0     20.0
       06                   .0       .0    100.0    138.0       .0      6.0       .0      4.0     20.0       .0     10.0
       07                 31.0     10.0       .0     30.0     15.0     50.0      6.0       .0      4.0       .0       .0
       08                   .0     44.0       .0     10.0     40.0       .0       .0       .0      7.0       .0      4.0
       09                 35.0      3.0     23.0       .0     20.0    140.0       .0      6.0       .0     32.0     16.0
       10                   .0     75.0       .0       .0     60.0       .0       .0     23.0      3.0      1.0      5.0
       11                   .0     17.0       .0     15.0     80.0       .0       .0     80.0       .0       .0      3.0
       12                   .0     75.0       .0      8.0       .0     63.0     10.0       .0       .0     17.0     10.0
       13                   .0     20.0       .0     60.0       .0       .0       .0    110.0     50.0      3.0     25.0
       14                 55.0       .0     26.0     12.0       .0      3.0    140.0      4.0     74.0       .0     38.0
       15                   .0       .0      3.0      7.0     10.0       .0      6.0       .0     35.0     12.0     27.0
       16                   .0      4.0     89.0     20.0      3.0       .0       .0     10.0       .0       .0       .0
       17                 45.0       .0      9.0       .0     30.0       .0      2.0       .0     60.0    103.0       .0
       18                 30.0       .0       .0       .0     21.0       .0     20.0     15.0       .0       .0       .0
       19                   .0    130.0       .0     10.0     12.0      8.0       .0      3.0     20.0     49.0     40.0
       20                 45.0       .0     25.0    190.0       .0     38.0      8.0       .0      8.0      3.0      1.0
       21                  1.0       .0     45.0     50.0       .0     35.0       .0      2.0     13.0      1.0      4.0
       22                   .0       .0     20.0       .0       .0       .0       .0     16.0     10.0     12.0     50.0
       23                 40.0       .0     40.0     16.0       .0     30.0       .0     13.0      2.0    106.0     10.0
       24                   .0       .0     45.0     60.0       .0      3.0       .0     25.0       .0     16.0       .0
       25                   .0       .0       .0       .0     18.0     10.0       .0      3.0       .0     50.0     20.0
       26                 10.0       .0       .0       .0      9.0      6.0     20.0     20.0      6.0     15.0      3.0
       27                   .0    135.0     60.0     40.0     80.0     15.0       .0     18.0     10.0     77.0       .0
       28                 10.0       .0      9.0     15.0       .0       .0       .0      6.0     72.0    102.0       .0
       29                 23.0      6.0       .0       .0       .0       .0       .0     23.0       .0     34.0       .0
       30                            .0     10.0       .0     20.0      3.0       .0     64.0     14.0    111.0       .0
       31                            .0              31.0              10.0       .0                .0                .0


                                  ***  ANNUAL VALUES  ***

                                 TOTAL                  6954.0
                                 No DE RAIN DAYS         210
                                 MAX 24 Hrs        190.0

文本文件包含一个接一个的报告，所有报告都具有相同的 header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"。我已经 "read" 使用 readLines() 函数的文本文件，我希望创建一个包含每个报告信息的数据框，如下所示：

DATE        STATION_ID  LAT    LON    ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010    0527 N 7632 W 100 LLORO CHOCO 0

我一直在尝试拆分每个报告，然后开始解析每一行。不幸的是，这是一个缓慢的过程。我知道此页面寻找带分隔符的问题，但我有点卡住了。

提前致谢。

Answer 1

这是一种方法。

使用readLines()阅读全页，56行。
通过了解纬度、经度、海拔、城市、州和年份的行号和行中的位置，确定来自 header 的信息。使用 substr()
用那里得到的年份，写出那一年的所有日期。 cbind 包含 header 信息。
使用一个函数，获取月份的日期和月份编号，并在页面上定位相应的降水量。行号为14 + dayOfMonth，水平偏移量可以是12个数字的向量，每个月一个。将该列添加到您的页面。

如果你 rbind 浏览每一页，你最终会得到一个长（！）整齐的数据集。 [edit] 如果你的数据集很大，你也会花费永恒的时间来管理内存。相反，您可以创建一个数据框列表并在最后绑定它们。有关详细信息，请参阅 this question and this question。

这是我想出的一些代码：您可以先在简短的摘录中对其进行测试。

library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)

framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)

framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\s+|\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)

# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)

# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set 
page <- cbind(pagedata,rainfall)
page
}

raw <- readLines("area1.txt") # read in all the data

# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")

listOfDataFrames <- vector(mode = "list", length = nrow(headers))

# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
  start <- headers[i,]
  end <- start + 56
  listOfDataFrames[[i]] <- raw2page(raw[start:end])
      }
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))

将一个不规则的文本数据文件读入R

Read an irregular text data file into R

text

r

plaintext

text-mining

dataframe