PostgreSQL DBA(42) - locale

2019-06-25 17:51:06

PostgreSQL在使用initdb初始化数据库时,提供了”本地化”的参数locale,如不指定该参数则默认为空,即使用OS的locale设定.
本地化设置对以下SQL特性有影响:
1.排序和比较操作 : Sort order in queries using ORDER BY or the standard comparison operators on textual data
2.内置函数 : The upper, lower, and initcap functions
3.模式匹配 : Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions
4.to_char相关函数 : The to_char family of functions
5.LIKE能否使用索引 : The ability to use indexes with LIKE clauses

排序
同样的数据,使用不同的LC_COLLATE,SQL输出不同:


postgres=# SELECT name FROM unnest(ARRAY['MYNAME', ' my_name', 'my-image.jpg', 'my-third-image.jpg']) name ORDER BY name collate "C";
        name        
--------------------
  my_name
 MYNAME
 my-image.jpg
 my-third-image.jpg
(4 rows)
postgres=# SELECT name FROM unnest(ARRAY['MYNAME', ' my_name', 'my-image.jpg', 'my-third-image.jpg']) name ORDER BY name collate "zh_CN";
        name        
--------------------
 my-image.jpg
  my_name
 MYNAME
 my-third-image.jpg
(4 rows)

collate指定为”C”,则使用默认的字符串的二进制ASCII码值进行对比,而指定是zh_CN则不是.

使用zh_CN其行为按不区分大小写进行处理


postgres=# SELECT name FROM unnest(ARRAY['MYNAME1', ' my_name2', 'my-image.jpg', 'my-third-image.jpg']) name ORDER BY name collate "zh_CN";
        name        
--------------------
 my-image.jpg
 MYNAME1
  my_name2
 my-third-image.jpg
(4 rows)
postgres=# SELECT name FROM unnest(ARRAY['myname1', ' myname2', 'myimage.jpg', 'mythirdimage.jpg']) name ORDER BY name collate "zh_CN";
       name       
------------------
 myimage.jpg
 myname1
  myname2
 mythirdimage.jpg
(4 rows)

邮件列表中的解释如下:

The behavior of each collation comes from the operating system’s own
libc, except for the C collation, which is based on the ordering
implied by strcmp() comparisons. Generally, most implementations have
the behavior you describe, in that they assign least weight of all to
caseness and whitespace, and somewhat more weight to punctuation. I
don’t think that there is much that can be done about it in practice,
though in principal there could be a collation that has all the
properties you want.

内置函数
如initcap,在法语和C下面会有不同


postgres=#  select initcap('élysée' collate "C");
 initcap 
---------
 éLyséE
(1 row)
postgres=#  select initcap('élysée' collate "fr_FR");
 initcap 
---------
 Élysée
(1 row)

在中文语境下,全角字符的小写字母会转换为全角的大写字母


postgres=# select initcap('ａ' collate "zh_CN");
 initcap 
---------
 Ａ
(1 row)
postgres=# select initcap('ａ' collate "C");
 initcap 
---------
 ａ
(1 row)

在LC_COLLATE下,只会对7F以下的ASCII字符生效,其他字符不生效

模式匹配


postgres=#  select 'élysée' ~ '^\w+$' collate "fr_FR";
 ?column? 
----------
 t
(1 row)
postgres=#  select 'élysée' COLLATE "C" ~ '^\w+$';
 ?column? 
----------
 f
(1 row)

LIKE能否使用索引


postgres=# CREATE TABLE t_sort (
postgres(#     a text COLLATE "zh_CN",
postgres(#     b text COLLATE "C");
CREATE TABLE
postgres=# 
postgres=# INSERT INTO t_sort SELECT md5(n::text), md5(n::text)
postgres-#     FROM generate_series(1, 1000000) n; 
INSERT 0 1000000
postgres=# CREATE INDEX ON t_sort USING btree (a);
CREATE INDEX
postgres=# CREATE INDEX ON t_sort USING btree (b);
CREATE INDEX
postgres=# ANALYZE t_sort;
ANALYZE
postgres=# SELECT * FROM t_sort LIMIT 2;
                a                 |                b                 
----------------------------------+----------------------------------
 c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
 c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
(2 rows)
postgres=# explain SELECT * FROM t_sort WHERE a LIKE 'c4ca4238a0%';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Gather  (cost=1000.00..18564.33 rows=100 width=66)
   Workers Planned: 2
   ->  Parallel Seq Scan on t_sort  (cost=0.00..17554.33 rows=42 width=66)
         Filter: (a ~~ 'c4ca4238a0%'::text)
(4 rows)
postgres=# explain SELECT * FROM t_sort WHERE b LIKE 'c4ca4238a0%';
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Index Scan using t_sort_b_idx on t_sort  (cost=0.42..8.45 rows=100 width=66)
   Index Cond: ((b >= 'c4ca4238a0'::text) AND (b < 'c4ca4238a1'::text))
   Filter: (b ~~ 'c4ca4238a0%'::text)
(3 rows)

使用zh_CN不能用上索引,但使用C可以用上索引

参考资料
Locale Support
One more time about collation in PostgreSQL
What is the impact of LC_CTYPE on a PostgreSQL database?
Re: Problem with PostgreSQL string sorting

分享好友

分享这个小栈给你的朋友们，一起进步吧。

PostgreSQL

创建时间：2020-06-17 14:30:20

PostgreSQL是一种特性非常齐全的自由软件的对象-关系型数据库管理系统（ORDBMS），是以加州大学计算机系开发的POSTGRES，4.2版本为基础的对象关系型数据库管理系统。POSTGRES的许多领先概念只是在比较迟的时候才出现在商业网站数据库中。PostgreSQL支持大部分的SQL标准并且提供了很多其他现代特性，如复杂查询、外键、触发器、视图、事务完整性、多版本并发控制等。同样，PostgreSQL也可以用许多方法扩展，例如通过增加新的数据类型、函数、操作符、聚集函数、索引方法、过程语言等。另外，因为许可证的灵活，任何人都可以以任何目的免费使用、修改和分发PostgreSQL。

展开

订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅：虚拟交易，一经交易不退款；若特殊情况，可3日内客服咨询

• 专区发布评论属默认订阅所评论专区（除付费小栈外）

技术专家

查看更多

小雨滴
专家