在 R 中读取具有不同列宽但固定分隔符的文本文件
Reading text file with varying column width but fixed delimiter in R
我有多个 .txt 文件,如下所示:
header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
最后 2 列的宽度各不相同,但所有列之间始终有 3 个空格(在本例中第 3 列为空)。
我正在使用此代码读取示例 .txt:
read.fwf(filename.txt,skip=5,widths=c(12,16,19,76,83),fill=T,fileEncoding = "UTF-16")
但是此代码无法在此 .txt 上正常运行:
header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
有没有一种方法可以读取具有固定分隔符(3 个空格)的 .txt 文件,而不必定义每列的宽度,因为文件之间的列宽不同。
文件也有一些编码问题,所以 here 是我使用的示例文件
可以跳过 header 行读取文件,然后使用 gsub
函数将 3 个空格替换为方便的分隔符(此处使用竖线):
> mytext = "01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox"
> ddf = read.table(text=gsub(" ", "|", mytext), header=F, sep="|")
> ddf
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
编辑:正如@r2evans 在下面的评论中所建议的,必须使用 gsub(" *$", "", ...)
修剪文本以删除尾随空格。或者,以下函数来自 How to trim leading and trailing whitespace in R?:
trim.trailing <- function (x) sub("\s+$", "", x)
对于文本文件,可以使用readLines读取文本文件:
> mytext = readLines(file('testfile.txt')) # read file text
> mytext = mytext[-c(1:5)] # remove first 5 rows ('header')
> mytext = gsub("\s+$", "", mytext) # remove trailing spaces
> mytext = gsub(" ", "|", mytext) # change separator
> ddf = read.table(text=mytext, header=F, sep='|') # read columns from text
> ddf
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
或者,可以先将它们读入一个变量的 data.frame,然后操作这些行以获得所需的结果:
> ddf1 = read.table(file='testfile.txt', sep = '\n', skip=5)
> mytext = gsub("\s+$", "", unlist(ddf1$V1))
> ddf2 = read.table(text=gsub(" ", "|", mytext), header=F, sep='|')
> ddf2
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
我不知道是否有寻找 multi-char 分隔符的好工具,而且您不是第一个问这个问题的人。大多数(包括 read.table
、read.delim
和 readr::read_delim
)需要一个 single-byte 分隔符。
一种方法,虽然对于大文件肯定效率不高,但将它们加载到 line-wise 中并自己进行拆分。
(消耗数据即底部。)
x <- readLines(textConnection(file1))
x <- x[x != 'header'] # or x <- x[-(1:5)]
(我猜它并不总是文字 header
,所以我假设它是一个固定计数,或者您可以轻松 "know" 哪个是哪个。)
spl <- strsplit(x, ' ')
str(spl)
# List of 3
# $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...
这看起来还可以,只是在你的例子中,右边有很多空白...
spl[[1]]
# [1] "01130009.JPG"
# [2] "JPEG"
# [3] ""
# [4] ""
# [5] "2/5/2018 3:53:44 PM"
# [6] "G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg"
# [7] "Gray Fox"
# [8] ""
# [9] ""
# [10] ""
# [11] ""
# [12] ""
# [13] ""
# [14] ""
# [15] ""
# [16] ""
# [17] ""
# [18] ""
# [19] ""
# [20] ""
# [21] ""
# [22] ""
# [23] ""
# [24] ""
# [25] ""
# [26] ""
# [27] ""
# [28] ""
# [29] ""
# [30] ""
# [31] ""
因此,如果您知道有多少列,那么您可以轻松删除额外内容:
spl <- lapply(spl, `[`, 1:7)
然后检查输出:
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# 2 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# 3 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
这与您的第二个示例同样有效:
x <- readLines(textConnection(file2))
x <- x[x != 'header'] # or x <- x[-(1:5)]
spl <- lapply(strsplit(x, ' '), `[`, 1:7)
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# 2 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# 3 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
消费数据:
# note: replaced single '\' with double '\' for R string-handling only
file1 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox '
file2 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox '
我有多个 .txt 文件,如下所示:
header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
最后 2 列的宽度各不相同,但所有列之间始终有 3 个空格(在本例中第 3 列为空)。
我正在使用此代码读取示例 .txt:
read.fwf(filename.txt,skip=5,widths=c(12,16,19,76,83),fill=T,fileEncoding = "UTF-16")
但是此代码无法在此 .txt 上正常运行:
header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
有没有一种方法可以读取具有固定分隔符(3 个空格)的 .txt 文件,而不必定义每列的宽度,因为文件之间的列宽不同。
文件也有一些编码问题,所以 here 是我使用的示例文件
可以跳过 header 行读取文件,然后使用 gsub
函数将 3 个空格替换为方便的分隔符(此处使用竖线):
> mytext = "01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox"
> ddf = read.table(text=gsub(" ", "|", mytext), header=F, sep="|")
> ddf
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
编辑:正如@r2evans 在下面的评论中所建议的,必须使用 gsub(" *$", "", ...)
修剪文本以删除尾随空格。或者,以下函数来自 How to trim leading and trailing whitespace in R?:
trim.trailing <- function (x) sub("\s+$", "", x)
对于文本文件,可以使用readLines读取文本文件:
> mytext = readLines(file('testfile.txt')) # read file text
> mytext = mytext[-c(1:5)] # remove first 5 rows ('header')
> mytext = gsub("\s+$", "", mytext) # remove trailing spaces
> mytext = gsub(" ", "|", mytext) # change separator
> ddf = read.table(text=mytext, header=F, sep='|') # read columns from text
> ddf
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
或者,可以先将它们读入一个变量的 data.frame,然后操作这些行以获得所需的结果:
> ddf1 = read.table(file='testfile.txt', sep = '\n', skip=5)
> mytext = gsub("\s+$", "", unlist(ddf1$V1))
> ddf2 = read.table(text=gsub(" ", "|", mytext), header=F, sep='|')
> ddf2
V1 V2 V3 V4 V5 V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
V7
1 Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3 Gray Fox
我不知道是否有寻找 multi-char 分隔符的好工具,而且您不是第一个问这个问题的人。大多数(包括 read.table
、read.delim
和 readr::read_delim
)需要一个 single-byte 分隔符。
一种方法,虽然对于大文件肯定效率不高,但将它们加载到 line-wise 中并自己进行拆分。
(消耗数据即底部。)
x <- readLines(textConnection(file1))
x <- x[x != 'header'] # or x <- x[-(1:5)]
(我猜它并不总是文字 header
,所以我假设它是一个固定计数,或者您可以轻松 "know" 哪个是哪个。)
spl <- strsplit(x, ' ')
str(spl)
# List of 3
# $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...
这看起来还可以,只是在你的例子中,右边有很多空白...
spl[[1]]
# [1] "01130009.JPG"
# [2] "JPEG"
# [3] ""
# [4] ""
# [5] "2/5/2018 3:53:44 PM"
# [6] "G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg"
# [7] "Gray Fox"
# [8] ""
# [9] ""
# [10] ""
# [11] ""
# [12] ""
# [13] ""
# [14] ""
# [15] ""
# [16] ""
# [17] ""
# [18] ""
# [19] ""
# [20] ""
# [21] ""
# [22] ""
# [23] ""
# [24] ""
# [25] ""
# [26] ""
# [27] ""
# [28] ""
# [29] ""
# [30] ""
# [31] ""
因此,如果您知道有多少列,那么您可以轻松删除额外内容:
spl <- lapply(spl, `[`, 1:7)
然后检查输出:
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# 2 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# 3 G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
这与您的第二个示例同样有效:
x <- readLines(textConnection(file2))
x <- x[x != 'header'] # or x <- x[-(1:5)]
spl <- lapply(strsplit(x, ' '), `[`, 1:7)
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# 2 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# 3 G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
消费数据:
# note: replaced single '\' with double '\' for R string-handling only
file1 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg Gray Fox '
file2 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB Gray Fox '