您的位置:首頁 > 軟件教程 > 教程 > 淺析MySQL 8.0直方圖原理

淺析MySQL 8.0直方圖原理

來源:好特整理 | 時(shí)間:2024-05-27 09:45:47 | 閱讀:119 |  標(biāo)簽: S   | 分享到:

本文將對直方圖概念進(jìn)行介紹,借助舉例描述直方圖的使用方式,對創(chuàng)建/刪除直方圖的原理進(jìn)行淺析,并通過例子說明其應(yīng)用場景。

本文分享自華為云社區(qū) 《【MySQL 技術(shù)專欄】MySQL8.0 直方圖介紹》 ,作者:GaussDB 數(shù)據(jù)庫。

背景

數(shù)據(jù)庫查詢優(yōu)化器負(fù)責(zé)將SQL查詢轉(zhuǎn)換為盡可能高效的執(zhí)行計(jì)劃,但因?yàn)閿?shù)據(jù)環(huán)境不斷變化導(dǎo)致優(yōu)化器對查詢數(shù)據(jù)了解的不夠充足,可能無法生成最優(yōu)的執(zhí)行計(jì)劃進(jìn)而影響查詢效率,因此MySQL8.0推出了直方圖(histogram)功能來解決該問題。

直方圖用于統(tǒng)計(jì)字段值的分布情況,向優(yōu)化器提供統(tǒng)計(jì)信息。利用直方圖,可以對一張表的一列數(shù)據(jù)做分布統(tǒng)計(jì),估算where條件中過濾字段的選擇率,從而幫助優(yōu)化器更準(zhǔn)確地估計(jì)查詢過程中的行數(shù),選擇更高效的查詢計(jì)劃。

本文將對直方圖概念進(jìn)行介紹,借助舉例描述直方圖的使用方式,對創(chuàng)建/刪除直方圖的原理進(jìn)行淺析,并通過例子說明其應(yīng)用場景。

MySQL8.0直方圖介紹

數(shù)據(jù)庫中,查詢優(yōu)化器所生成執(zhí)行計(jì)劃的好壞關(guān)乎執(zhí)行耗時(shí)的多少,優(yōu)化器若是不清楚表中數(shù)據(jù)的分布情況,可能會導(dǎo)致無法生成最優(yōu)的執(zhí)行計(jì)劃,造成執(zhí)行時(shí)浪費(fèi)時(shí)間。

假設(shè)一條SQL語句要查詢相等間隔的兩個(gè)不同時(shí)間段內(nèi)出行的人數(shù),若不知道每個(gè)時(shí)間段內(nèi)的人數(shù),優(yōu)化器會假設(shè)人數(shù)在兩個(gè)不同時(shí)間段內(nèi)是均勻分布的。如果兩個(gè)時(shí)間段內(nèi)人數(shù)相差較大,這樣優(yōu)化器估算的統(tǒng)計(jì)數(shù)據(jù)就出現(xiàn)嚴(yán)重偏差,從而可能選擇錯(cuò)誤的執(zhí)行計(jì)劃。那么,如何使優(yōu)化器比較清楚地知道數(shù)據(jù)統(tǒng)計(jì)情況進(jìn)而生成好的執(zhí)行計(jì)劃呢?

一種解決方法就是,在列上建立直方圖,從而近似地獲取一列上的數(shù)據(jù)分布情況。利用好直方圖,將會帶來很多方面收益:

(1)查詢優(yōu)化:提供關(guān)于數(shù)據(jù)分布的統(tǒng)計(jì)信息,幫助優(yōu)化查詢計(jì)劃,選擇合適的索引和優(yōu)化查詢語句,從而提高查詢性能;

(2)索引設(shè)計(jì):通過分析數(shù)據(jù)的分布情況,幫助確定哪些列適合創(chuàng)建索引,以提高查詢效率;

(3)數(shù)據(jù)分析:提供數(shù)據(jù)的分布情況,幫助用戶了解數(shù)據(jù)的特征和趨勢。

直方圖分為兩類:等寬直方圖(singleton)和等高直方圖(equi-height)。等寬直方圖是每個(gè)桶保存一個(gè)值以及這個(gè)值累積頻率:

SCHEMA_NAME: xxx//庫名

TABLE_NAME: xxx//表名

COLUMN_NAME: xxx//列名

HISTOGRAM: {

"buckets":[

[

xxx, //桶中數(shù)值

xxx //取值頻率

],

......

],

"data-type":"xxx", //數(shù)據(jù)類型

"null-values":xxx, //是否有NULL值

"collation-id":xxx,

"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx", //更新時(shí)間

"sampling-rate":xxx, //采樣率,1表示采集所有數(shù)據(jù)

"histogram-type":"singleton", //桶類型,等寬

"number-of-buckets-specified":xxx //桶數(shù)量

}

等高直方圖每個(gè)桶需要保存不同值的個(gè)數(shù),上下限以及累積頻率等:

SCHEMA_NAME: xxx

TABLE_NAME: xxx

COLUMN_NAME: xxx

HISTOGRAM: {

"buckets":[

[

xxx, //最小值

xxx, //最大值

xxx, //桶值出現(xiàn)的頻率

xxx //桶值出現(xiàn)的次數(shù)

],

......

],

"data-type":"xxx",

"null-values":xxx,

"collation-id":xxx,

"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx",

"sampling-rate":xxx,

"histogram-type":"equi-height", //桶類型,等高

"number-of-buckets-specified":xxx

}

MySQL8.0直方圖使用方式

創(chuàng)建和刪除直方圖時(shí)涉及analyze語句,常用語法格式為:

創(chuàng)建直方圖:

ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] ... [WITH N BUCKETS]

刪除直方圖:

ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name] ...

具體示例:

mysql> create table t1(c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int,c10 int,c11 int,c12 int,c13 datetime,c14 int,c15 int,c16 int,primary key(c1));

Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values(1,2,3,4,5,6,7,8,9,10,11,12,'0000-01-01',14,15,16),(2,2,3,4,5,6,7,8,9,10,11,12,'0500-01-01',14,15,16),(3,2,3,4,5,6,7,8,9,10,11,12,'1000-01-01',14,15,16),(4,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16),(5,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16);

Query OK, 5 rows affected (0.00 sec)

Records: 5 Duplicates: 0 Warnings: 0

創(chuàng)建直方圖:

mysql> analyze table t1 update histogram on c13;

+---------+-----------+----------+------------------------------------------------+

| Table | Op | Msg_type | Msg_text |

+---------+-----------+----------+------------------------------------------------+

| test.t1 | histogram | status | Histogram statistics created for column 'c13'. |

+---------+-----------+----------+------------------------------------------------+

1 row in set (0.01 sec)

查看直方圖信息:

mysql> select json_pretty(histogram)result from information_schema.column_statistics where table_name = 't1' and column_name = 'c13'\G

*************************** 1. row ***************************

result: {

"buckets": [

[

"0000-01-01 00:00:00.000000", //統(tǒng)計(jì)的列值

0.2 //統(tǒng)計(jì)的相對頻率,下同

],

[

"0500-01-01 00:00:00.000000",

0.4

],

[

"1000-01-01 00:00:00.000000",

0.6

],

[

"1500-01-01 00:00:00.000000",

1.0

]

],

"data-type": "datetime", //統(tǒng)計(jì)的數(shù)據(jù)類型

"null-values": 0.0, //NULL值的比例

"collation-id": 8, //直方圖數(shù)據(jù)的排序規(guī)則ID

"last-updated": "2023-09-30 16:05:28.533732", //最近更新直方圖的時(shí)間

"sampling-rate": 1.0, //直方圖構(gòu)建采樣率

"histogram-type": "singleton", //直方圖類型,等寬

"number-of-buckets-specified": 100 //桶數(shù)量

}

1 row in set (0.00 sec)

刪除直方圖:

mysql> analyze table t1 drop histogram on c13;

+---------+-----------+----------+------------------------------------------------+

| Table | Op | Msg_type | Msg_text |

+---------+-----------+----------+------------------------------------------------+

| test.t1 | histogram | status | Histogram statistics removed for column 'c13'. |

+---------+-----------+----------+------------------------------------------------+

1 row in set (0.00 sec)

MySQL8.0直方圖原理淺析

直方圖原理整體框架可概括為下圖所示:

直方圖代碼主要包含在sql/histograms路徑下,帶有 equi_height前綴的相關(guān)文件涉及 等高直方圖,帶有 singleton前綴的相關(guān)文件涉及 等寬直方圖,帶有value_map前綴的相關(guān)文件涉及保存統(tǒng)計(jì)值結(jié)構(gòu),histogram.h/histogram.cc涉及直方圖相關(guān)調(diào)用接口。

Sql_cmd_analyze_table::handle_histogram_command為對直方圖操作的整體處理入口,目前只支持在一張表上進(jìn)行直方圖相關(guān)操作。創(chuàng)建直方圖的主要調(diào)用堆棧如下所示,update_histogram為創(chuàng)建直方圖的入口。

mysql_execute_command

->Sql_cmd_analyze_table::execute

->Sql_cmd_analyze_table::handle_histogram_command

->Sql_cmd_analyze_table::update_histogram

->histograms::update_histogram

->prepare_value_maps

->fill_value_maps

->build_histogram

->store_histogram

->dd::cache::Dictionary_client::update

->dd::cache::Storage_adapter::store

->dd::Column_statistics_impl::store_attributes

->histograms::Singleton::histogram_to_json

對于創(chuàng)建流程展開描述,prepare_value_maps中主要根據(jù)直方圖列類型創(chuàng)建對應(yīng)的value_map做準(zhǔn)備,之后利用histogram_generation_max_mem_size參數(shù)值(限制生成直方圖時(shí)所允許使用的最大內(nèi)存大小)和單行數(shù)據(jù)大小計(jì)算后控制統(tǒng)計(jì)采樣率,fill_value_maps將反復(fù)讀取數(shù)據(jù)填充到對應(yīng)類型的value_map中,key為列實(shí)際值,value為其出現(xiàn)的次數(shù)。調(diào)用build_histogram以完成對直方圖的構(gòu)建,如果桶個(gè)數(shù)(num_buckets)比不同值個(gè)數(shù)(value_map.size())要大,則自動(dòng)創(chuàng)建一個(gè)等寬直方圖,否則創(chuàng)建一個(gè)等高直方圖。兩種直方圖的創(chuàng)建邏輯分別在Singleton:: build_histogram和Equi_height:: build_histogram中。

構(gòu)建直方圖完成后調(diào)用store_histogram,將結(jié)果以JSON的形式存儲在系統(tǒng)表中,通過INFORMATION_SCHEMA.COLUMN_STATISTICS對用戶呈現(xiàn),histogram_to_json會將直方圖結(jié)果轉(zhuǎn)換為Json_object格式,例如last-updated使用Json_datetime格式保存、histogram-type使用Json_string格式保存、sampling rate使用Json_double格式保存等,再依次調(diào)用json_object->add_clone將各json類型字段保存。

刪除直方圖的主要堆棧如下所示。drop_histograms邏輯中在刪除直方圖前會先嘗試獲取以檢查對應(yīng)直方圖是否真的存在,不存在的話就提前終止邏輯,存在則刪除。

mysql_execute_command

->Sql_cmd_analyze_table::execute

->Sql_cmd_analyze_table::handle_histogram_command

->Sql_cmd_analyze_table::update_histogram

->histograms::update_histogram

MySQL8.0直方圖優(yōu)化場景

優(yōu)化方面,如本文在前所描述的直方圖作用,利用直方圖信息估算where條件中各謂詞的選擇率,幫助選擇最優(yōu)的執(zhí)行計(jì)劃。例如,表存在如下所示數(shù)據(jù)傾斜場景。

mysql> select sys_id,order_status,count(*) from my_table_1 group by sys_id,order_status order by 1,2,3;

+--------+--------------+----------+

| sys_id | order_status | count(*) |

+--------+--------------+----------+

| 3 | 1 | 1 |

| 3 | 2 | 200766 |

| 3 | 3 | 3353 |

| 3 | 4 | 1325 |

| 5 | 1 | 13 |

| 5 | 2 | 2478373 |

| 5 | 3 | 43243 |

| 5 | 4 | 13529 |

| 6 | 2 | 171388 |

| 6 | 3 | 254 |

| 6 | 4 | 716 |

+--------+--------------+----------+

執(zhí)行如下SQL語句時(shí),因?yàn)榇嬖跀?shù)據(jù)傾斜而優(yōu)化器未能準(zhǔn)確估計(jì)導(dǎo)致執(zhí)行計(jì)劃選擇錯(cuò)誤,執(zhí)行耗時(shí)約為1.35s。

mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=4163.10 rows=20) (actual time=1350.825..1350.825 rows=0 loops=1)

-> Nested loop left join (cost=4163.10 rows=49) (actual time=1350.825..1350.825 rows=0 loops=1)

-> Filter: ((t1.order_status = 1) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=215.79 rows=49) (actual time=1350.823..1350.823 rows=0 loops=1)

-> Index scan on t1 using PRIMARY (reverse) (cost=215.79 rows=8828) (actual time=0.088..1209.201 rows=2910194 loops=1)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.63 rows=1) (never executed)

通過執(zhí)行ANALYZE table my_table_1 UPDATE HISTOGRAM ON order_status, sys_id, create_time語句創(chuàng)建直方圖后,再次執(zhí)行上述SQL語句時(shí),執(zhí)行計(jì)劃中的索引發(fā)生了變化,執(zhí)行耗時(shí)為0.11s。因此可以看出,優(yōu)化器利用更準(zhǔn)確的數(shù)據(jù)分布信息選擇了更優(yōu)的執(zhí)行計(jì)劃。

mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=38385.46 rows=20) (actual time=114.217..114.217 rows=0 loops=1)

-> Nested loop left join (cost=38385.46 rows=62764) (actual time=114.216..114.216 rows=0 loops=1)

-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=28200.86 rows=62668) (actual time=114.215..114.215 rows=0 loops=1)

-> Filter: (t1.order_status = 1) (cost=28200.86 rows=62668) (actual time=114.207..114.207 rows=0 loops=1)

-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=28200.86 rows=62668) (actual time=0.326..112.912 rows=31142 loops=1)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.62 rows=1) (never executed)

另外,當(dāng)where條件中變量值不同時(shí),優(yōu)化器也根據(jù)數(shù)據(jù)分布情況選擇了準(zhǔn)確的執(zhí)行計(jì)劃,使得執(zhí)行效率提高。

mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (2) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=13541.27 rows=20)

-> Nested loop left join (cost=13541.27 rows=44)

-> Filter: ((t1.order_status = 2) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=15.79 rows=44)

-> Index scan on t1 using PRIMARY (reverse) (cost=15.79 rows=338)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)

1 row in set (0.00 sec)

mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (4) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=30559.31 rows=20)

-> Nested loop left join (cost=30559.31 rows=55852)

-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=24966.26 rows=55480)

-> Filter: (t1.order_status = 4) (cost=24966.26 rows=55480)

-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=24966.26 rows=55480)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)

1 row in set (0.00 sec)

所以,通過所提供的統(tǒng)計(jì)信息,幫助優(yōu)化查詢計(jì)劃進(jìn)而提高查詢性能是如前所述應(yīng)用直方圖的一個(gè)收益點(diǎn)。

點(diǎn)擊關(guān)注,第一時(shí)間了解華為云新鮮技術(shù)~

小編推薦閱讀

好特網(wǎng)發(fā)布此文僅為傳遞信息,不代表好特網(wǎng)認(rèn)同期限觀點(diǎn)或證實(shí)其描述。

相關(guān)視頻攻略

更多

掃二維碼進(jìn)入好特網(wǎng)手機(jī)版本!

掃二維碼進(jìn)入好特網(wǎng)微信公眾號!

本站所有軟件,都由網(wǎng)友上傳,如有侵犯你的版權(quán),請發(fā)郵件[email protected]

湘ICP備2022002427號-10 湘公網(wǎng)安備:43070202000427號© 2013~2024 haote.com 好特網(wǎng)