监控工作做得不好,出了问题就要背锅,这是我们运维人员多年血泪换来的教训。还记得好几个月前, XXX运营商出了一个重大故障,从故障发生到分析定位花费了几个小时,当时一群人扑向故障,最终在某个人的灵光一现下找到了问题的原因。
为什么需要分析那么长时间呢?大部分原因是由于高并发系统,一旦出现资源紧张,就会蹦出来各种混乱的问题,让你无从着手,不知道到底谁影响了谁?你可能查了半天的原因发现居然是其他程序被影响的。所以要想快速地进行故障分析,并将故障消灭在萌芽状态,建立完善的监控程序至关重要。
咱有了监控,就可以把锅甩给别人。真是个大实话(明明是为了分析问题做得更好)。
自定义SQL监控
以前安装的模板足够一些边缘系统日常使用,但对于一些稍微有点重要性的系统来说,它显然是不够的。例如:TOP SQL。所以我们今天要学习的就是如何定制监控。
参考github文档:
PG_EXPORTER_EXTEND_QUERY_PATH Path to a YAML file containing custom queries to run. Check out queries.yaml for examples of the format.
我们需要配置PG_EXPORTER_EXTEND_QUERY_PATH环境变量指定YAML file文件,在文件中配置监控SQL。而queries.yaml的案例如下。
pg_statio_user_tables:
query: "SELECT current_database() datname, schemaname, relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit FROM pg_statio_user_tables"
metrics:
- datname:
usage: "LABEL"
description: "Name of current database"
- schemaname:
usage: "LABEL"
description: "Name of the schema that this table is in"
- relname:
usage: "LABEL"
description: "Name of this table"
.............
先要为这个定制查询配置名称,然后写上 sql语句,然后在每个查询的列上配置 usage,可以将 usage配置为 LABEL或 GAUGE、 COUNTER。刚学这些的人肯定不懂术语。因此,我决定还是在 github上搜一下,看看是否有这方面的配置案例。

在此感谢 Feng Ruohang,它提供了一些参考配置。这些脚本还支持pg13。更新节奏杠杠的啊。

我主要参考了一下460-pg_query.yaml的配置。
pg_query:
name: pg_query
desc: PostgreSQL statement metrics, require pg_stat_statements installed in schema monitor, 9.4 ~ 12
query: |
SELECT datname, query, calls, total_time, min_time, max_time, mean_time, stddev_time, rows, blk_io_time FROM
(SELECT dbid,
queryid AS query,
sum(calls) AS calls,
sum(total_time) AS total_time,
min(min_time) AS min_time,
max(max_time) AS max_time,
max(mean_time) AS mean_time,
max(stddev_time) AS stddev_time,
sum(rows) AS rows,
sum(blk_read_time) + sum(blk_write_time) AS blk_io_time
FROM pg_stat_statements(false) pg_stat_statements(userid, dbid, queryid, query, calls,
total_time, min_time, max_time, mean_time, stddev_time, rows,
shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written,
local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written,
blk_read_time, blk_write_time)
WHERE dbid != 1 AND userid != 10 AND calls > 4
GROUP BY dbid, queryid ORDER BY total_time DESC LIMIT 64
) q NATURAL JOIN (SELECT oid AS dbid, datname FROM pg_database WHERE datname NOT IN ('postgres','template0','template1')) d;
ttl: 10
timeout: 1
min_version: 090400
max_version: 130000
tags:
- cluster
- extension:pg_stat_statements
- schema:monitor
metrics:
- datname:
usage: LABEL
description: database name
- query:
usage: LABEL
description: query identifier, bigint
- calls:
usage: COUNTER
description: times been executed
- total_time:
usage: COUNTER
description: Total time spent in the statement, in µs
- min_time:
usage: GAUGE
description: Minimum time spent in the statement, in µs
- max_time:
usage: GAUGE
description: Maximum time spent in the statement, in µs
- mean_time:
usage: GAUGE
description: Mean time spent in the statement, in µs
- stddev_time:
usage: GAUGE
description: Population standard deviation of time spent in the statement, in µs
- rows:
usage: COUNTER
description: rows retrieved or affected by the statement
- blk_io_time:
usage: COUNTER
description: time spent reading/writing blocks in µs (if track_io_timing is enabled)
手动run一下查看执行结果。

配置好环境变量后,直接运行postgres_exporter把agent调起来。
export DATA_SOURCE_NAME="postgresql://postgres:Sqlite123@192.168.56.119:5432/postgres?sslmode=disable"
export PG_EXPORTER_EXTEND_QUERY_PATH="/root/postgres_exporter_v0.8.0_linux-amd64/queries.yaml"
./postgres_exporter &

然后就是在Grafana中显示top sql。先增加一个panel。然后选择图表为表格,Metrics填topk(10, pg_query_13_mean_time{})。topk可以简单理解取top 10的值。这里取的是pg_query_13_mean_time这项数据,代表的是平均执行时长。我们也可以选择其它的比如pg_query_13_calls(调用次数)。pg_query_13_blk_io_time(读写IO消耗的时间)。

Metrics配置好了,需要配置表格的字段和字段的类型,列出你想列出的字段值

右边可以在配置一个图形搭配着显示,经过最终配置,完成效果如下:

虽然配置成功,但是感觉总是差那么点意思。
后记
今天主要还在监控上折腾,说实话刚刚开始碰Prometheus,遇到了很多问题。这里还是要感谢一下Feng Ruohang提供的脚本,少走了一些弯路。
PostgreSQL中文社区欢迎广大技术人员投稿
投稿邮箱:press@postgres.cn
情侣网名霸气恩爱:https://www.nanss.com/mingcheng/316.html 好听的游戏id:https://www.nanss.com/mingcheng/311.html 工作经历证明:https://www.nanss.com/gongzuo/187.html 尉迟恭是怎么死的:https://www.nanss.com/wenti/119.html 游戏名字简单干净:https://www.nanss.com/mingcheng/272.html 诗意的网名:https://www.nanss.com/mingcheng/393.html 电台文本情感故事:https://www.nanss.com/gongzuo/143.html 家常排骨汤:https://www.nanss.com/yinshi/88.html 好词好句摘抄大全:https://www.nanss.com/xuexi/37.html 物业工作总结:https://www.nanss.com/gongzuo/48.html 群聊名称:https://www.nanss.com/mingcheng/406.html 白色象征什么:https://www.nanss.com/wenti/121.html 13个完美请假借口:https://www.nanss.com/shenghuo/365.html 仙女ins森系网名:https://www.nanss.com/mingcheng/349.html 网名男生霸气冷酷好听:https://www.nanss.com/mingcheng/145.html 诗意又撩人的游戏名字:https://www.nanss.com/mingcheng/273.html 抖音酷到爆的句子:https://www.nanss.com/wenan/11.html 毕业登记表自我鉴定:https://www.nanss.com/xuexi/405.html 昵称大全男:https://www.nanss.com/mingcheng/494.html 论文指导记录:https://www.nanss.com/gongzuo/386.html 俱乐部名字:https://www.nanss.com/mingcheng/50.html 实习自我鉴定:https://www.nanss.com/xuexi/336.html 阴宅风水:https://www.nanss.com/jiaju/77.html 细节决定成败的名言:https://www.nanss.com/xuexi/25.html 新年文案:https://www.nanss.com/wenan/10.html 仙气超甜的网名:https://www.nanss.com/mingcheng/361.html 微信昵称:https://www.nanss.com/mingcheng/297.html 芋头怎么吃:https://www.nanss.com/yinshi/94.html 男生微信网名:https://www.nanss.com/mingcheng/488.html 低调网名有男人味:https://www.nanss.com/mingcheng/375.html