读取 r 中的文本文件并将读取的内容存储在下一行
reading text file in r and store what is read conditioned on the next line
我有一个 .txt 文件,格式如下:
--------------------------------------------------------------------------------------------------------------
m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
--------------------------------------------------------------------------------------------------------------
m5a3 A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------
type: numeric (int)
label: NUMERIC, but 44 nonmissing values are not labeled
range: [-9,120] units: 1
unique values: 47 missing .: 0/4898
examples: -9 -9 Not in wave
-6 -6 Skip
-6 -6 Skip
-6 -6 Skip
--------------------------------------------------------------------------------------------------------------
对我来说重要的是代号 m5a2
、描述 [=14=],最后是响应的值
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
我需要将这三个项目读入一个列表以便进一步处理。
我尝试了以下方法,它可以检索代号和描述。
fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
{
L[i] = linn[i+1]
}
else
{
# read until hit the next dashed line
}
}
close(conn)
有几点我很困惑:
1. 我不知道如何让它读取行,直到它碰到下一条下一条虚线。
2. 如果我希望能够可视化搜索并轻松检索数据,我将读取的数据存储在列表中的方法是否正确?
谢谢。
这个呢?
df <- read.table("file.txt",
header = FALSE)
df
这会有点问题,因为每个项目的格式都非常不规则。这是第一项密码本文本中的 运行:
txt <- "m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
"
Lines <- readLines( textConnection(txt))
# isolate lines with letter in first column
Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:
scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
sep=",", what="")
#----
Read 2 items
[1] "m5a2"
[2] "A2. Confirm how much time child lives with respondent"
"tabulation" 行可用于创建列标签。
colnames <- scan(text=sub(".*tabulation[:]", "",
Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items
逗号替换策略需要更多地涉及到后面的行。首先隔离数字是第一个非 space 字符的行:
dataRows <- Lines[grep("^[ ]*\d", Lines)]
然后用逗号替换模式 digit-2+spaces 并阅读 read.csv:
myDat <- read.csv(text=
gsub("(\d)[ ]{2,}", "\1,", dataRows ),
header=FALSE ,col.names=colnames)
#------------
myDat
V1 V2 V3
1 1383 -9 -9 Not in wave
2 4 -2 -2 Don't know
3 2 -1 -1 Refuse
4 3272 1 1 all or most of the time
5 29 2 2 about half of the time
6 76 3 3 some of the time
7 80 4 4 none of the time
8 52 7 7 only on weekends
如果 Lines 对象是整个文件,例如位于:
的文件,则可以使用从 cumsum( grepl("^-------", Lines)
生成的计数器循环多个项目
Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
input string 7353 is invalid in this locale
我的"hand-held scan()-er"建议我只有两种类型的密码本记录:"tabulations"(大概少于10个实例的项目)和"examples"(有更多实例的项目).它们具有不同的结构(如上面代码簿片段中所示),因此可能只需要构建和部署两种类型的解析逻辑。所以我认为执行此操作的工具如上所述。
所有警告都与用作撇号的字符“\x92”有关。正则表达式和 R 共享一个转义字符“\”,因此您需要对转义符进行转义。他们可以更正:
Lines <- gsub("\\x92", "'", Lines )
我有一个 .txt 文件,格式如下:
--------------------------------------------------------------------------------------------------------------
m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
--------------------------------------------------------------------------------------------------------------
m5a3 A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------
type: numeric (int)
label: NUMERIC, but 44 nonmissing values are not labeled
range: [-9,120] units: 1
unique values: 47 missing .: 0/4898
examples: -9 -9 Not in wave
-6 -6 Skip
-6 -6 Skip
-6 -6 Skip
--------------------------------------------------------------------------------------------------------------
对我来说重要的是代号 m5a2
、描述 [=14=],最后是响应的值
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
我需要将这三个项目读入一个列表以便进一步处理。
我尝试了以下方法,它可以检索代号和描述。
fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
{
L[i] = linn[i+1]
}
else
{
# read until hit the next dashed line
}
}
close(conn)
有几点我很困惑: 1. 我不知道如何让它读取行,直到它碰到下一条下一条虚线。 2. 如果我希望能够可视化搜索并轻松检索数据,我将读取的数据存储在列表中的方法是否正确?
谢谢。
这个呢?
df <- read.table("file.txt",
header = FALSE)
df
这会有点问题,因为每个项目的格式都非常不规则。这是第一项密码本文本中的 运行:
txt <- "m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
"
Lines <- readLines( textConnection(txt))
# isolate lines with letter in first column
Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:
scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
sep=",", what="")
#----
Read 2 items
[1] "m5a2"
[2] "A2. Confirm how much time child lives with respondent"
"tabulation" 行可用于创建列标签。
colnames <- scan(text=sub(".*tabulation[:]", "",
Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items
逗号替换策略需要更多地涉及到后面的行。首先隔离数字是第一个非 space 字符的行:
dataRows <- Lines[grep("^[ ]*\d", Lines)]
然后用逗号替换模式 digit-2+spaces 并阅读 read.csv:
myDat <- read.csv(text=
gsub("(\d)[ ]{2,}", "\1,", dataRows ),
header=FALSE ,col.names=colnames)
#------------
myDat
V1 V2 V3
1 1383 -9 -9 Not in wave
2 4 -2 -2 Don't know
3 2 -1 -1 Refuse
4 3272 1 1 all or most of the time
5 29 2 2 about half of the time
6 76 3 3 some of the time
7 80 4 4 none of the time
8 52 7 7 only on weekends
如果 Lines 对象是整个文件,例如位于:
的文件,则可以使用从cumsum( grepl("^-------", Lines)
生成的计数器循环多个项目
Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
input string 7353 is invalid in this locale
我的"hand-held scan()-er"建议我只有两种类型的密码本记录:"tabulations"(大概少于10个实例的项目)和"examples"(有更多实例的项目).它们具有不同的结构(如上面代码簿片段中所示),因此可能只需要构建和部署两种类型的解析逻辑。所以我认为执行此操作的工具如上所述。
所有警告都与用作撇号的字符“\x92”有关。正则表达式和 R 共享一个转义字符“\”,因此您需要对转义符进行转义。他们可以更正:
Lines <- gsub("\\x92", "'", Lines )