我应该将 ASCII-only varchar 保存为 UTF-8 还是 ASCII?

Should I save ASCII-only varchar in UTF-8 or ASCII?

我有一个仅包含 ASCII 符号的 varchar 列。我不需要按这个字段排序,但我需要按完全相等来搜索它。

默认区域设置为 en.UTF8。如果我用 collate "C" 创建这个专栏,我会得到什么吗?

是的,有区别。

即使您不刻意排序,也有各种操作需要内部排序步骤(一些聚合函数,DISTINCT,嵌套循环连接等)。

此外,字段上的任何 index 都必须在内部对值进行排序 - 并遵守排序规则,除非 COLLATE "C" 适用(无排序规则)。

对于按完全相等 的搜索,您需要一个索引——它可以以任何一种方式工作(为了相等),但它在没有归类规则的情况下总体上更快。根据您的用例的详细信息,影响可能可以忽略不计或很大。影响随着琴弦的长度而增加。我运行前段时间一个相关案例的benchmark:

  • Slow query ordering by a column in a joined table

此外,还有更多与语言环境 "C" 的模式匹配选项。另一种方法是使用特殊的 varchar_pattern_ops 运算符 class.

创建索引

相关:

Postgres 9.5 通过称为 "abbreviated keys" 的技术引入了性能改进,运行 解决了一些问题语言环境。所以它被停用了,除了 C 语言环境。 Quoting The release notes of Postgres 9.5.2:

  • Disable abbreviated keys for string sorting in non-C locales (Robert Haas)

PostgreSQL 9.5 introduced logic for speeding up comparisons of string data types by using the standard C library function strxfrm() as a substitute for strcoll(). It now emerges that most versions of glibc (Linux's implementation of the C library) have buggy implementations of strxfrm() that, in some locales, can produce string comparison results that do not match strcoll(). Until this problem can be better characterized, disable the optimization in all non-C locales. (C locale is safe since it uses neither strcoll() nor strxfrm().)

Unfortunately, this problem affects not only sorting but also entry ordering in B-tree indexes, which means that B-tree indexes on text, varchar, or char columns may now be corrupt if they sort according to an affected locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.

It is not possible at this time to give an exhaustive list of known-affected locales. C locale is known safe, and there is no evidence of trouble in English-based locales such as en_US, but some other popular locales such as de_DE are affected in most glibc versions.

这个问题还说明了一般情况下归类规则的作用。