读取 r 中的文本文件并将读取的内容存储在下一行

reading text file in r and store what is read conditioned on the next line

我有一个 .txt 文件,格式如下:

--------------------------------------------------------------------------------------------------------------
m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

--------------------------------------------------------------------------------------------------------------
m5a3                                                    A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (int)
                 label:  NUMERIC, but 44 nonmissing values are not labeled

                 range:  [-9,120]                     units:  1
         unique values:  47                       missing .:  0/4898

              examples:  -9    -9 Not in wave
                         -6    -6 Skip
                         -6    -6 Skip
                         -6    -6 Skip

--------------------------------------------------------------------------------------------------------------

对我来说重要的是代号 m5a2、描述 [​​=14=],最后是响应的值

tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

我需要将这三个项目读入一个列表以便进一步处理。

我尝试了以下方法,它可以检索代号和描述。

fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
  if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
  {
    L[i] = linn[i+1]
  }

  else
  {
    # read until hit the next dashed line
  }
}
close(conn)

有几点我很困惑: 1. 我不知道如何让它读取行,直到它碰到下一条下一条虚线。 2. 如果我希望能够可视化搜索并轻松检索数据,我将读取的数据存储在列表中的方法是否正确?

谢谢。

这个呢?

df <- read.table("file.txt", 
             header = FALSE)
df

这会有点问题,因为每个项目的格式都非常不规则。这是第一项密码本文本中的 运行:

txt <- "m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends
"
Lines <- readLines( textConnection(txt))
 # isolate lines with letter in first column
 Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:

scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
     sep=",", what="")
#----
Read 2 items
[1] "m5a2"                                                 
[2] "A2. Confirm how much time child lives with respondent"

"tabulation" 行可用于创建列标签。

colnames <- scan(text=sub(".*tabulation[:]", "",
                     Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items

逗号替换策略需要更多地涉及到后面的行。首先隔离数字是第一个非 space 字符的行:

dataRows <- Lines[grep("^[ ]*\d", Lines)]

然后用逗号替换模式 digit-2+spaces 并阅读 read.csv:

 myDat <- read.csv(text=  
                      gsub("(\d)[ ]{2,}", "\1,", dataRows ), 
                   header=FALSE ,col.names=colnames)

#------------
 myDat
    V1 V2                        V3
1 1383 -9            -9 Not in wave
2    4 -2             -2 Don't know
3    2 -1                 -1 Refuse
4 3272  1 1 all or most of the time
5   29  2  2 about half of the time
6   76  3        3 some of the time
7   80  4        4 none of the time
8   52  7        7 only on weekends

如果 Lines 对象是整个文件,例如位于:

的文件,则可以使用从 cumsum( grepl("^-------", Lines) 生成的计数器循环多个项目
 Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
  input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
  input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
  input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
  input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
  input string 7353 is invalid in this locale

我的"hand-held scan()-er"建议我只有两种类型的密码本记录:"tabulations"(大概少于10个实例的项目)和"examples"(有更多实例的项目).它们具有不同的结构(如上面代码簿片段中所示),因此可能只需要构建和部署两种类型的解析逻辑。所以我认为执行此操作的工具如上所述。

所有警告都与用作撇号的字符“\x92”有关。正则表达式和 R 共享一个转义字符“\”,因此您需要对转义符进行转义。他们可以更正:

Lines <- gsub("\\x92", "'", Lines )