Java / Postgres / Mybatis - 编码字节序列无效 "UTF8": 0xe3 0xa1 0x54

Question

在我们的应用服务器上，DevOps 团队在 Postgres (9.4) 数据库中使用 SQL_ASCII 编码。

第 3 方应用程序正在将带有重音字符的姓氏插入员工 table，例如努涅斯

我的 Java (8) 应用程序是 Spring (4.3.15) WebApp 使用 Mybatis (3.2.4)

当我的应用程序从 SQL_ASCII 数据库中读取此类姓氏时，我得到：

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:616) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:466) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:459) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at jdk.internal.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:55) at com.sun.proxy.$Proxy98.execute(Unknown Source)

如果我尝试通过以下方式更改 client_encoding：

SET client_encoding = 'SQL_ASCII';

然后我得到错误：

org.postgresql.util.PSQLException: The server's client_encoding parameter was changed to LATIN1. The JDBC driver requires client_encoding to be UTF8 for correct operation. at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1950)

我怎样才能“安全地”从数据库中读取这些字符？

Answer 1

allowEncodingChanges=true

您可以尝试在 JDBC 连接 URL 中设置 allowEncodingChanges=true 吗？（还有 characterEncoding）

allowEncodingChanges = boolean

When using the V3 protocol the driver monitors changes in certain server configuration parameters that should not be touched by end users. The client_encoding setting is set by the driver and should not be altered. If the driver detects a change it will abort the connection. There is one legitimate exception to this behaviour though, using the COPY command on a file residing on the server's filesystem. The only means of specifying the encoding of this file is by altering the client_encoding setting. The JDBC team considers this a failing of the COPY command and hopes to provide an alternate means of specifying the encoding in the future, but for now there is this URL parameter. Enable this only if you need to override the client encoding when doing a copy.

参考：Chapter 3. Initializing the Driver

错误消息中的字节是1110 0011、1010 0001、0101 0100

如果存储的数据在 ISO-8859-1 中编码，这将是 ã、¡、T。

当此字节流被读取为UTF-8时，第一个字节中的1110^MSB表示一个3 UTF-8字节字符（数自己）。

所以接下来的2个字节应该以10^MSB开头。但是 3^rd 字节以 01^MSB

开头

默认情况下，JDBC 驱动程序在 UTF-8 中解码此流，并因无效字节流而失败。

假设第三个字节以 `10`^MSB 开始，代码可以正常工作，但它可能会错误地将所有这些 `3` 字节映射到a `single Unicode code point`（假设原始编码不是 `UTF-8` 并且相应的 unicode 代码点具有有效的字符表示）。

Answer 2

你迷路了。 SQL_ASCII 数据库不支持编码，它平等对待所有字节（0 字节除外）。数据库中不会有编码转换。

因此，除非数据被意外编码为 UTF-8（根据错误消息），否则您不能将其与 JDBC 驱动程序一起使用。

您将不得不转储数据库并将其恢复（使用适当的 -E 选项）到具有适当编码的不同数据库 (v13) 中。在此过程中，任何编码不一致都必须手动修复。

将提供更多见解。

Java / Postgres / Mybatis - 编码字节序列无效 "UTF8": 0xe3 0xa1 0x54

Java / Postgres / Mybatis - invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54

java

postgresql

utf-8

allowEncodingChanges=true

假设第三个字节以 `10`^MSB 开始，代码可以正常工作，但它可能会错误地将所有这些 `3` 字节映射到a `single Unicode code point`（假设原始编码不是 `UTF-8` 并且相应的 unicode 代码点具有有效的字符表示）。

Java / Postgres / Mybatis - 编码字节序列无效 "UTF8": 0xe3 0xa1 0x54

Java / Postgres / Mybatis - invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54

java

postgresql

utf-8

allowEncodingChanges=true

假设第三个字节以 10MSB 开始，代码可以正常工作，但它可能会错误地将所有这些 3 字节映射到a single Unicode code point（假设原始编码不是 UTF-8 并且相应的 unicode 代码点具有有效的字符表示）。

假设第三个字节以 `10`^MSB 开始，代码可以正常工作，但它可能会错误地将所有这些 `3` 字节映射到a `single Unicode code point`（假设原始编码不是 `UTF-8` 并且相应的 unicode 代码点具有有效的字符表示）。