如何在 r 中将 PostgreSQL bytea 列十六进制解码为 int16/uint16?
How to decode PostgreSQL bytea column hex to int16/uint16 in r?
我有一些图像数据作为 bytea 存储在 PostgreSQL 数据库 table 列中。我还有关于用于解释它的数据的元数据,相关的是图像尺寸和 class。 类 包括 int16、uint16。我找不到任何关于在 R.
中正确解释 signed/unsigned 整数的信息
我正在使用 RPostgreSQL 将数据拉入 R,我想在 R 中查看图像。
MWE:
# fakeDataQuery <- dbGetQuery(conn,
# 'select byteArray, ImageSize, ImageClass from table where id = 1')
# Example 1 (no negative numbers)
# the actual byte array shown in octal sequences in pgadmin (1.22.2) Query Output is:
# "[=11=]1[=11=]0[=11=]2[=11=]0[=11=]3[=11=]0[=11=]4[=11=]0[=11=]5[=11=]0[=11=]6[=11=]0[=11=]7[=11=]00[=11=]01[=11=]0"
# but RPostgreSQL returns the hex-encoded version:
byteArray <- "\x010002000300040005000600070008000900"
ImageSize <- c(3, 3, 1)
ImageClass <- 'int16'
# expected result
> array(c(1,2,3,4,5,6,7,8,9), dim=c(3,3,1))
# , , 1
#
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
# Example 2: (with negtive numbers)
byteArray <- "\xffff00000100020003000400050006000700080009000a00"
ImageSize <- c(3, 4, 1)
ImageClass <- 'int16'
# expectedResult
> array(c(-1,0,1,2,3,4,5,6,7,8,9,10), dim=c(3,4,1))
#, , 1
#
# [,1] [,2] [,3] [,4]
#[1,] -1 2 5 8
#[2,] 0 3 6 9
#[3,] 1 4 7 10
我尝试过的:
来自 PostgreSQL 的 bytea 数据是一个编码为 "hex" 的长字符串,您可以通过它前面的 \x
来判断(我相信有一个额外的 \
用于转义现有的?):https://www.postgresql.org/docs/9.1/static/datatype-binary.html(参见:第 8.4.1 节。'bytea Hex format')
解码'hex'回原始类型('int16'基于ImageClass)
每the same url above, hex encoding uses '2 hexadecimal digits per byte'. So I need to split the encoded byteArray into the appropriate length substrings, see: this link
# remove the \x hex encoding indicator(s) added by PostgreSQL
byteArray <- gsub("\x", "", x = byteArray, fixed=T)
l <- 2 # hex digits per byte (substring length)
byteArray <- strsplit(trimws(gsub(pattern = paste0("(.{",l,"})"),
replacement = "\1 ",
x = byteArray)),
" ")[[1]]
# for some reason these appear to be in the opposite order than i expect
# Ex: 1 is stored as '0100' rather than '0001'
# so reverse the digits (int16 specific)
byteArray <- paste0(byteArray[c(F,T)],byteArray[c(T,F)])
# strtoi() converts a vector of hex values given a decimal base
byteArray <- strtoi(byteArray, 16L)
# now make it into an n x m x s array,
# e.g., 512 x 512 x (# slices)
V = array(byteArray, dim = ImageSize)
这个解决方案有两个问题:
- 它不适用于有符号类型,因此负整数值将被解释为无符号值(例如,'ffff' 是 -1 (int16) 但 65535 (uint16) 和 strtoi() 将 return 65535 总是)。
- 它目前仅针对 int16 进行编码,需要一些额外的代码才能与其他类型(例如 int32、int64)一起使用
有人有适用于签名类型的解决方案吗?
您可以从 this conversion function 开始,替换为更快的 strsplit
并在结果上使用 readBin
:
byteArray <- "\xffff00000100020003000400050006000700080009000a00"
## Split a long string into a a vector of character pairs
Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
unsigned int length = hex.length()/2;
CharacterVector res(length);
for (unsigned int i = 0; i < length; ++i) {
res(i) = hex.substr(2*i, 2);
}
return res;
}')
## A function to convert one string to an array of raw
f <- function(x) {
## Split a long string into a a vector of character pairs
x <- strsplit2(x)
## Remove the first element, "\x"
x <- x[-1]
## Complete the conversion
as.raw(as.hexmode(x))
}
raw <- f(byteArray)
# int16
readBin(con = raw,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = TRUE,
endian = "little")
# -1 0 1 2 3 4 5 6 7 8 9 10
# uint16
readBin(con = raw,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = FALSE,
endian = "little")
# 65535 0 1 2 3 4 5 6 7 8 9 10
# int32
readBin(con = raw,
what = "integer",
n = length(raw) / 4,
size = 4,
signed = TRUE,
endian = "little")
# 65535 131073 262147 393221 524295 655369
不过,这不适用于 uint32
和 (u)int64
,因为 R 在内部使用 int32
。但是,R 也可以使用 numerics
来存储 2^52 以下的整数。所以我们可以使用这个:
# uint32
byteArray <- "\xffffffff0100020003000400050006000700080009000a00"
int32 <- readBin(con = f(byteArray),
what = "integer",
n = length(raw) / 4,
size = 4,
signed = TRUE,
endian = "little")
ifelse(int32 < 0, int32 + 2^32, int32)
# 4294967295 131073 262147 393221 524295 655369
而对于 gzip
压缩数据:
# gzip
byteArray <- "\x1f8b080000000000000005c1870100200800209a56faffbd41d30dd3b285e37a52f9d033018818000000"
con <- gzcon(rawConnection(f(byteArray)))
readBin(con = con,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = TRUE,
endian = "little")
close(con = con)
由于这是一个真正的连接,我们必须确保关闭它。
我有一些图像数据作为 bytea 存储在 PostgreSQL 数据库 table 列中。我还有关于用于解释它的数据的元数据,相关的是图像尺寸和 class。 类 包括 int16、uint16。我找不到任何关于在 R.
中正确解释 signed/unsigned 整数的信息我正在使用 RPostgreSQL 将数据拉入 R,我想在 R 中查看图像。
MWE:
# fakeDataQuery <- dbGetQuery(conn,
# 'select byteArray, ImageSize, ImageClass from table where id = 1')
# Example 1 (no negative numbers)
# the actual byte array shown in octal sequences in pgadmin (1.22.2) Query Output is:
# "[=11=]1[=11=]0[=11=]2[=11=]0[=11=]3[=11=]0[=11=]4[=11=]0[=11=]5[=11=]0[=11=]6[=11=]0[=11=]7[=11=]00[=11=]01[=11=]0"
# but RPostgreSQL returns the hex-encoded version:
byteArray <- "\x010002000300040005000600070008000900"
ImageSize <- c(3, 3, 1)
ImageClass <- 'int16'
# expected result
> array(c(1,2,3,4,5,6,7,8,9), dim=c(3,3,1))
# , , 1
#
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
# Example 2: (with negtive numbers)
byteArray <- "\xffff00000100020003000400050006000700080009000a00"
ImageSize <- c(3, 4, 1)
ImageClass <- 'int16'
# expectedResult
> array(c(-1,0,1,2,3,4,5,6,7,8,9,10), dim=c(3,4,1))
#, , 1
#
# [,1] [,2] [,3] [,4]
#[1,] -1 2 5 8
#[2,] 0 3 6 9
#[3,] 1 4 7 10
我尝试过的:
来自 PostgreSQL 的 bytea 数据是一个编码为 "hex" 的长字符串,您可以通过它前面的 \x
来判断(我相信有一个额外的 \
用于转义现有的?):https://www.postgresql.org/docs/9.1/static/datatype-binary.html(参见:第 8.4.1 节。'bytea Hex format')
解码'hex'回原始类型('int16'基于ImageClass)
每the same url above, hex encoding uses '2 hexadecimal digits per byte'. So I need to split the encoded byteArray into the appropriate length substrings, see: this link
# remove the \x hex encoding indicator(s) added by PostgreSQL
byteArray <- gsub("\x", "", x = byteArray, fixed=T)
l <- 2 # hex digits per byte (substring length)
byteArray <- strsplit(trimws(gsub(pattern = paste0("(.{",l,"})"),
replacement = "\1 ",
x = byteArray)),
" ")[[1]]
# for some reason these appear to be in the opposite order than i expect
# Ex: 1 is stored as '0100' rather than '0001'
# so reverse the digits (int16 specific)
byteArray <- paste0(byteArray[c(F,T)],byteArray[c(T,F)])
# strtoi() converts a vector of hex values given a decimal base
byteArray <- strtoi(byteArray, 16L)
# now make it into an n x m x s array,
# e.g., 512 x 512 x (# slices)
V = array(byteArray, dim = ImageSize)
这个解决方案有两个问题:
- 它不适用于有符号类型,因此负整数值将被解释为无符号值(例如,'ffff' 是 -1 (int16) 但 65535 (uint16) 和 strtoi() 将 return 65535 总是)。
- 它目前仅针对 int16 进行编码,需要一些额外的代码才能与其他类型(例如 int32、int64)一起使用
有人有适用于签名类型的解决方案吗?
您可以从 this conversion function 开始,替换为更快的 strsplit
并在结果上使用 readBin
:
byteArray <- "\xffff00000100020003000400050006000700080009000a00"
## Split a long string into a a vector of character pairs
Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
unsigned int length = hex.length()/2;
CharacterVector res(length);
for (unsigned int i = 0; i < length; ++i) {
res(i) = hex.substr(2*i, 2);
}
return res;
}')
## A function to convert one string to an array of raw
f <- function(x) {
## Split a long string into a a vector of character pairs
x <- strsplit2(x)
## Remove the first element, "\x"
x <- x[-1]
## Complete the conversion
as.raw(as.hexmode(x))
}
raw <- f(byteArray)
# int16
readBin(con = raw,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = TRUE,
endian = "little")
# -1 0 1 2 3 4 5 6 7 8 9 10
# uint16
readBin(con = raw,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = FALSE,
endian = "little")
# 65535 0 1 2 3 4 5 6 7 8 9 10
# int32
readBin(con = raw,
what = "integer",
n = length(raw) / 4,
size = 4,
signed = TRUE,
endian = "little")
# 65535 131073 262147 393221 524295 655369
不过,这不适用于 uint32
和 (u)int64
,因为 R 在内部使用 int32
。但是,R 也可以使用 numerics
来存储 2^52 以下的整数。所以我们可以使用这个:
# uint32
byteArray <- "\xffffffff0100020003000400050006000700080009000a00"
int32 <- readBin(con = f(byteArray),
what = "integer",
n = length(raw) / 4,
size = 4,
signed = TRUE,
endian = "little")
ifelse(int32 < 0, int32 + 2^32, int32)
# 4294967295 131073 262147 393221 524295 655369
而对于 gzip
压缩数据:
# gzip
byteArray <- "\x1f8b080000000000000005c1870100200800209a56faffbd41d30dd3b285e37a52f9d033018818000000"
con <- gzcon(rawConnection(f(byteArray)))
readBin(con = con,
what = "integer",
n = length(raw) / 2,
size = 2,
signed = TRUE,
endian = "little")
close(con = con)
由于这是一个真正的连接,我们必须确保关闭它。