直方图原理剖析测试
表结构两个字段id1,id2,其中 id1 的NDV是67181,id2的NDV是10
SQL> desc t1;
Name
Null? Type
-----------------------------------------
-------- ----------------------------
ID1
NUMBER
ID2
NUMBER
SQL> select count(distinct id1) "id1-ndv" from t1;
id1-ndv
----------
67181
SQL> select count(*) "rows-of-t1" from t1;
rows-of-t1
----------
67181
SQL> select count(distinct id2)
"id2-ndv" from t1;
id2-ndv
----------
10
SQL> select distinct id2 ,count(*) from
t1 group by id2 order by id2;
ID2 COUNT(*)
---------- ----------
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
10000 67172
10 rows selected.
测试一: 我们创建索引,不收集直方图。
SQL> create index idx_t1_id2 on t1(id2);
Index created.
收集列id1,和id2上的统计信息,但是不创建直方图
09:21:34 SQL> exec
dbms_stats.gather_table_stats(user,'T1',method_opt=>'for columns id2 size
1');
PL/SQL procedure successfully completed.
Elapsed: 00:00:02.53
09:28:12 SQL> exec
dbms_stats.gather_table_stats(user,'T1',method_opt=>'for columns id1 size
1');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.35
09:28:13 SQL> select table_name,column_name,num_distinct,low_value,high_value,density,num_nulls,num_buckets,histogram
from user_tab_columns where table_name='T1';
TABLE_NAME
COLUMN_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE
DENSITY NUM_NULLS NUM_BUCKETS HISTOGRAM
----------
-------------------- ------------ -------------------- --------------------
---------- ---------- ----------- ----------
T1 ID1 67181 C103 C3082533 .000014885
1 NONE
T1 ID2 10 C103 C302 .1 1 NONE
2 rows selected.
这里统计信息没有直方图信息,但是NUM_DISTINCT,DENSITY都有了,这个DENSITY优化器会用,计算表返回的行数,从而选择合适的执行计划 。
09:39:25 SQL> set autotr on;
09:39:29 SQL> select * from t1 where
id2=6;
ID1 ID2
---------- ----------
6 6
1 row selected.
Elapsed: 00:00:00.04
Execution Plan
----------------------------------------------------------
Plan hash value: 2490154277
-------------------------------------------------------------------------------
| Id | Operation |
Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6718 | 53744 | 28
(0)| 00:00:01 |
| 1 |
TABLE ACCESS BY INDEX ROWID BATCHED| T1 | 6718 | 53744 | 28
(0)| 00:00:01 |
|* 2 |
INDEX RANGE SCAN |
IDX_T1_ID2 | 6718 | |
14 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Predicate Information (identified by
operation id):
---------------------------------------------------
2
- access("ID2"=6)
Statistics
----------------------------------------------------------
1 recursive calls
db block gets
4 consistent gets
physical reads
redo size
625 bytes sent via SQL*Net to
client
390 bytes received via SQL*Net
from client
2 SQL*Net roundtrips to/from
client
sorts (memory)
sorts (disk)
1 rows processed
10:10:23 SQL>select * from t1 where
id2=10000
。。。。。省略大量输出
62647
10000
67172 rows selected.
Elapsed: 00:00:18.25
Execution Plan
----------------------------------------------------------
Plan hash value:
2490154277
-------------------------------------------------------------------------------------------
| Id | Operation |
Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT |
| 6718 | 53744 | 28
(0)| 00:00:01 |
| 1 |
TABLE ACCESS BY INDEX ROWID BATCHED | T1 | 6718 | 53744 | 28
(0)| 00:00:01 |
|* 2 | INDEX
RANGE SCAN | IDX_T1_ID2 | 6718 |
| 14
(0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by
operation id):
---------------------------------------------------
2
- access("ID2"=10000)
Statistics
----------------------------------------------------------
1 recursive calls
db block gets
9210 consistent gets
physical reads
redo size
1732580 bytes sent via SQL*Net to
client
49652 bytes received via SQL*Net
from client
4480 SQL*Net roundtrips to/from
client
sorts (memory)
sorts (disk)
67172 rows processed
这两个字段的数据分布显然不同,id2=10000有67172行,显然这里的统计信息给予优化器的信息不准确,其实优化器使用了Density这个值,计算每个字段的数据,目前没有直方图,这个统计信息默认将10个不同的值平均分布,即平均。导致对于id2=10000的统计信息不准确,从而走了错误的执行计划,应该走全表扫描。下面我们收集直方图
10:00:06 SQL> select
COLUMN_NAME,ENDPOINT_NUMBER,ENDPOINT_VALUE,ENDPOINT_REPEAT_COUNT from
user_tab_histograms where table_name='T1' and column_name='ID2';
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
ENDPOINT_REPEAT_COUNT
-------------------- ---------------
-------------- ---------------------
ID2 1 2
ID2 2 3
ID2 3 4
ID2 4 5
ID2 5 6
ID2 6 7
ID2 7 8
ID2 8 9
ID2 9 10
ID2 67181 10000
10 rows selected.
对于频率直方图而言,endpoint_number的差是对应ndv值数量,比如3有2-1=1个,1000有67181-9=67172个
再次看执行计划是否改变
10:16:32 SQL> explain plan for
10:17:05
2 select * from t1 where
id2=10000;
Explained.
Elapsed: 00:00:00.00
10:17:17 SQL> select * from
table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------
Plan hash value: 3617692013
--------------------------------------------------------------------------
| Id
| Operation |
Name | Rows | Bytes | Cost (%CPU)|
Time |
--------------------------------------------------------------------------
| 0
| SELECT STATEMENT | |
67172 |
524K| 69 (2)| 00:00:01 |
|* 1
| TABLE ACCESS FULL | T1
| 67172 | 524K|
69 (2)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by
operation id):
---------------------------------------------------
1
- filter("ID2"=10000)
13 rows selected.
Elapsed: 00:00:00.06
这里统计信息变了,通过直方图优化器知道返回67192行,此时选择了全表扫描,注意此时的COst值为69比走索引时的值高,但是此时是一个合理的执行计划,
索引对于DBA,看SQL是否有基本不看Cost值,它是优化器做选择的依据,但是这个选择有时是错误的。
有了直方图后,优化器计算通过索引的成本就会增加,我们使用hint强制走索引看下此时的成本值
67172 rows selected.
Elapsed: 00:00:16.90
Execution Plan
----------------------------------------------------------
Plan hash value: 2490154277
-------------------------------------------------------------------------------
| Id | Operation |
Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 67172 |
524K| 271 (1)| 00:00:01 |
| 1 |
TABLE ACCESS BY INDEX ROWID BATCHED| T1 | 67172 | 524K|
271 (1)| 00:00:01 |
|* 2 |
INDEX RANGE SCAN
| IDX_T1_ID2 | 67172 | |
132 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by
operation id):
---------------------------------------------------
2
- access("ID2"=10000)
Statistics
----------------------------------------------------------
1 recursive calls
db block gets
9210 consistent gets
physical reads
redo size
1732580 bytes sent via SQL*Net to
client
49902 bytes received via SQL*Net
from client
4480 SQL*Net roundtrips to/from
client
sorts (memory)
sorts (disk)
67172 rows processed
发现此时的cost值为271显然成本很高,所以在有直方图的情况下,Oracle选择使用全表扫描,而不是通过索引找id=10000的数据。