在 r 中读取二进制映射文件
Reading a binary map file in r
我正在尝试读取 R 中的二进制文件,其中包含一个简单的 360x180 值二维数组。作为参考,可以在此处找到二进制文件:
http://transcom.project.asu.edu/download/transcom03/smoothmap.fix.2.bin
这是此 .bin 的自述文件内容:
The file 'smoothmap.fix.2.bin' contains a single real, binary array
dimensioned 360 x 180. The array contains the numbers 1 through 22,
denoting each of the 22 basis functions in the TransCom 3 experiment.
This file was written on an SGI Origin 2000 hosting UNIX.
我的代码:
to.read <- file("smoothmap.fix.2.bin", "rb")
raw.transcom <- readBin(to.read, integer(), n = 360*180, size = 4, endian = "big")
transcom <- matrix(raw.transcom, 180, 360, byrow = F)
现在 raw.transcom 只包含垃圾值:
unique(raw.transcom)
[1] 259200 0 1101004800 1082130432 1092616192 1097859072 1100480512 1102053376 1086324736
[10] 1077936128 1101529088 1095761920 1096810496 1099956224 1091567616 1084227584 1090519040 1094713344
[19] 1099431936 1073741824 1093664768 1088421888 1065353216 1098907648
为什么会这样?
我已经看了一个小时了,我被难住了。试过字节序设置和 readBin 中的 'size',但这没有帮助。
我怎样才能正确读入这个文件?
好吧,我没有时间研究 "R" 的方法,但我确实可以访问 GDL 并找到 this,所以我拼凑起来:
Data = read_binary('smoothmap.fix.2.bin',DATA_TYPE=4,ENDIAN='big');
Data = Data[1:64800]
Data = reform(Data,[360,180])
openw,unit,'testfile.dat',/get_lun
printf,unit,Data
free_lun,unit
并成功生成:http://rud.is/dl/testfile.dat.gz
如果你抓住它然后做:
x <- as.numeric(scan("testfile.dat.gz", "numeric"))
length(x)
## [1] 64800
table(x)
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
看起来它确实为您指定的定义提供了正确的值,您可以将其转换为矩阵。
请回来查看,因为我现在需要弄清楚如何在 R 中执行此操作:-)
更新
知道了!
我很高兴我找到了 IDL 代码来验证 R 结果。
x <- readBin("smoothmap.fix.2.bin", "raw", file.size("smoothmap.fix.2.bin"))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
table(readBin(rawConnection(x), "numeric", 360*180, 4, endian="big"))
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
理想情况下,我们会检查前 4 个字节和后 4 个字节是否相等,但这个 hack 应该能让你通过。
综合起来
添加了代码的验证位…
#' Read in a binary array, likely written with IDL
#'
#' @param x path to file (auto-expanded & tested for existence)
#' @param n number of `float` elements to read in
#' @param endian endian-ness (default `big`)
#' @return numeric vector of length `n`
read_binary_float <- function(x, n, endian="big") {
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4] # extract front bits
last4 <- x[(length(x)-3):length(x)] # extract back bits
# convert both to long ints
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
# validation
stopifnot(f4 == l4) # check front/back are equal
stopifnot(f4 == n*4) # check if `n` matches expected record count
# strip off front and back bits
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
# slurp it all in
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
readBin(con = rc, what = "numeric", n = n, size = 4L, endian=endian)
}
快速示例:
library(magrittr)
read_binary_float("smoothmap.fix.2.bin", 360*180) %>%
matrix(nrow = 360, ncol = 180) %>%
image()
此文件似乎符合 Fortran "unformatted I/O" 规范:https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc4/index.html:这证实了
"# records" | record | record | … | record | "# records"
我们看到了。因此,该函数可以推广到支持的不仅仅是 float
转换:
read_binary_array <- function(x, type=c("byte", "integer", "float"), endian="big") {
type <- match.arg(trimws(tolower(type)), c("byte", "integer", "float"))
type_size <- unname(c("byte"=1, "integer"=4, "float"=4)[type])
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4]
last4 <- x[(length(x)-3):length(x)]
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
stopifnot(f4 == l4) # check front/back are equal
stopifnot((f4 %% type_size == 0)) # shld have nothing left over
n_rec <- f4 / type_size
message(sprintf("Reading in %s records...", scales::comma(n_rec)))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
what <- switch(type, byte="raw", integer="integer", float="numeric")
dat <- readBin(con = rc, what = what, n = n_rec, size = type_size, endian=endian)
dat
}
此文未完成,已发布以取得进展。
数据文件中可能存在未记录的 "feature",因为前八个字节不是数据的一部分。 (该文件是 259208,但 360*180*4==259200
。)不过,我确实发现了其他有趣的东西:
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180, size = 4, endian = "big")
head(d)
# [1] 259200 0 0 0 0 0
我要推断第一个4字节整数(259200
)表示数据的大小,所以我建议我们可以丢弃它。您可能会争辩说您的矢量长度是否合适,但那是因为您强制 readBin
停止加载数据。来自 ?readBin
:
n: integer. The (maximal) number of records to be read. You
can use an over-estimate here, but not too large as storage
is reserved for 'n' items.
因此读取超出您预期的文件大小应该是安全的,它会自行处理 EOF。我随便加10:
length(d)
# [1] 64800
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180+10, size = 4, endian = "big")
length(d)
# [1] 64802
tail(d)
# [1] 1098907648 1098907648 1098907648 1098907648 1098907648 259200
(请注意,尽管我建议再读取 10 个字节,但只有两个可用。所以你知道,n
参数的基本原理是为了预分配内存,仅此而已。) 259200
又出现了,我会推断确认数据结束,所以我们应该能够安全地丢弃这两个 (first/last) 数字。
d <- d[-c(1, length(d))]
第一个非零数是:
head(which(d>0))
# [1] 4321 4322 4323 4324 4325 4326
d[4321]
# [1] 1101004800
并查看位:
intToBits(d[4321])
# [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 01
# [26] 00 00 00 00 00 01 00
因此,如果您推断出直接二进制解释,则该值为 2820,与可用值的 smoothmap.readme
描述不匹配。此外,我们期待看到的是:
intToBits(22)
# [1] 00 01 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
看来您的位……顺序不对,或类似的问题。如果你 intToBits
所有唯一值,你会注意到所有位 1-19(最低有效位)都是零。
从这里开始,我不知所措了...
sapply(unique(d), function(a) packBits(rev(intToBits(a)), type="integer"))
# [1] 0 1410 258 1154 3714 6530 3458 770 514 5506 2690 1666 2434 2178 1282 130 642 4482 2 3202 1794 508 386
我正在尝试读取 R 中的二进制文件,其中包含一个简单的 360x180 值二维数组。作为参考,可以在此处找到二进制文件:
http://transcom.project.asu.edu/download/transcom03/smoothmap.fix.2.bin
这是此 .bin 的自述文件内容:
The file 'smoothmap.fix.2.bin' contains a single real, binary array dimensioned 360 x 180. The array contains the numbers 1 through 22, denoting each of the 22 basis functions in the TransCom 3 experiment. This file was written on an SGI Origin 2000 hosting UNIX.
我的代码:
to.read <- file("smoothmap.fix.2.bin", "rb")
raw.transcom <- readBin(to.read, integer(), n = 360*180, size = 4, endian = "big")
transcom <- matrix(raw.transcom, 180, 360, byrow = F)
现在 raw.transcom 只包含垃圾值:
unique(raw.transcom)
[1] 259200 0 1101004800 1082130432 1092616192 1097859072 1100480512 1102053376 1086324736
[10] 1077936128 1101529088 1095761920 1096810496 1099956224 1091567616 1084227584 1090519040 1094713344
[19] 1099431936 1073741824 1093664768 1088421888 1065353216 1098907648
为什么会这样?
我已经看了一个小时了,我被难住了。试过字节序设置和 readBin 中的 'size',但这没有帮助。
我怎样才能正确读入这个文件?
好吧,我没有时间研究 "R" 的方法,但我确实可以访问 GDL 并找到 this,所以我拼凑起来:
Data = read_binary('smoothmap.fix.2.bin',DATA_TYPE=4,ENDIAN='big');
Data = Data[1:64800]
Data = reform(Data,[360,180])
openw,unit,'testfile.dat',/get_lun
printf,unit,Data
free_lun,unit
并成功生成:http://rud.is/dl/testfile.dat.gz
如果你抓住它然后做:
x <- as.numeric(scan("testfile.dat.gz", "numeric"))
length(x)
## [1] 64800
table(x)
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
看起来它确实为您指定的定义提供了正确的值,您可以将其转换为矩阵。
请回来查看,因为我现在需要弄清楚如何在 R 中执行此操作:-)
更新
知道了!
我很高兴我找到了 IDL 代码来验证 R 结果。
x <- readBin("smoothmap.fix.2.bin", "raw", file.size("smoothmap.fix.2.bin"))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
table(readBin(rawConnection(x), "numeric", 360*180, 4, endian="big"))
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
理想情况下,我们会检查前 4 个字节和后 4 个字节是否相等,但这个 hack 应该能让你通过。
综合起来
添加了代码的验证位…
#' Read in a binary array, likely written with IDL
#'
#' @param x path to file (auto-expanded & tested for existence)
#' @param n number of `float` elements to read in
#' @param endian endian-ness (default `big`)
#' @return numeric vector of length `n`
read_binary_float <- function(x, n, endian="big") {
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4] # extract front bits
last4 <- x[(length(x)-3):length(x)] # extract back bits
# convert both to long ints
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
# validation
stopifnot(f4 == l4) # check front/back are equal
stopifnot(f4 == n*4) # check if `n` matches expected record count
# strip off front and back bits
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
# slurp it all in
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
readBin(con = rc, what = "numeric", n = n, size = 4L, endian=endian)
}
快速示例:
library(magrittr)
read_binary_float("smoothmap.fix.2.bin", 360*180) %>%
matrix(nrow = 360, ncol = 180) %>%
image()
此文件似乎符合 Fortran "unformatted I/O" 规范:https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc4/index.html:这证实了
"# records" | record | record | … | record | "# records"
我们看到了。因此,该函数可以推广到支持的不仅仅是 float
转换:
read_binary_array <- function(x, type=c("byte", "integer", "float"), endian="big") {
type <- match.arg(trimws(tolower(type)), c("byte", "integer", "float"))
type_size <- unname(c("byte"=1, "integer"=4, "float"=4)[type])
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4]
last4 <- x[(length(x)-3):length(x)]
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
stopifnot(f4 == l4) # check front/back are equal
stopifnot((f4 %% type_size == 0)) # shld have nothing left over
n_rec <- f4 / type_size
message(sprintf("Reading in %s records...", scales::comma(n_rec)))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
what <- switch(type, byte="raw", integer="integer", float="numeric")
dat <- readBin(con = rc, what = what, n = n_rec, size = type_size, endian=endian)
dat
}
此文未完成,已发布以取得进展。
数据文件中可能存在未记录的 "feature",因为前八个字节不是数据的一部分。 (该文件是 259208,但 360*180*4==259200
。)不过,我确实发现了其他有趣的东西:
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180, size = 4, endian = "big")
head(d)
# [1] 259200 0 0 0 0 0
我要推断第一个4字节整数(259200
)表示数据的大小,所以我建议我们可以丢弃它。您可能会争辩说您的矢量长度是否合适,但那是因为您强制 readBin
停止加载数据。来自 ?readBin
:
n: integer. The (maximal) number of records to be read. You can use an over-estimate here, but not too large as storage is reserved for 'n' items.
因此读取超出您预期的文件大小应该是安全的,它会自行处理 EOF。我随便加10:
length(d)
# [1] 64800
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180+10, size = 4, endian = "big")
length(d)
# [1] 64802
tail(d)
# [1] 1098907648 1098907648 1098907648 1098907648 1098907648 259200
(请注意,尽管我建议再读取 10 个字节,但只有两个可用。所以你知道,n
参数的基本原理是为了预分配内存,仅此而已。) 259200
又出现了,我会推断确认数据结束,所以我们应该能够安全地丢弃这两个 (first/last) 数字。
d <- d[-c(1, length(d))]
第一个非零数是:
head(which(d>0))
# [1] 4321 4322 4323 4324 4325 4326
d[4321]
# [1] 1101004800
并查看位:
intToBits(d[4321])
# [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 01
# [26] 00 00 00 00 00 01 00
因此,如果您推断出直接二进制解释,则该值为 2820,与可用值的 smoothmap.readme
描述不匹配。此外,我们期待看到的是:
intToBits(22)
# [1] 00 01 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
看来您的位……顺序不对,或类似的问题。如果你 intToBits
所有唯一值,你会注意到所有位 1-19(最低有效位)都是零。
从这里开始,我不知所措了...
sapply(unique(d), function(a) packBits(rev(intToBits(a)), type="integer"))
# [1] 0 1410 258 1154 3714 6530 3458 770 514 5506 2690 1666 2434 2178 1282 130 642 4482 2 3202 1794 508 386