我们的文章会在微信公众号IT民工的龙马人生和博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。
这个案例是PG环境中归档空间耗尽,导致PG数据库异常关闭的的一个案例,这个案例重点在于分享一下,怎么快速的去找到是什么SQL语句导致空间耗尽的。
1 故障现象
这里看到归档失败,No space left on device没有多余的空间。
2025-07-28 05:42:47.499 CST,,,1789,,6886966a.6fd,1,,2025-07-28 05:13:14 CST,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: test ! -f /postgresql/archive/1721/000000080000002600000006 && cp pg_wal/000000080000002600000006 /postgresql/archive/1721/000000080000002600000006",,,,,,,"pgarch_archiveXlog, pgarch.c:589","","archiver",,0
2025-07-28 05:42:47.519 CST,,,4291,,68869ced.10c3,1,,2025-07-28 05:41:01 CST,5/1481,0,PANIC,53100,"could not write to file ""pg_wal/xlogtemp.4291"": No space left on device",,,,,"writing block 279403 of relation base/13025/65649
while scanning block 279435 of relation ""public.big_table""",,,"XLogFileInit, xlog.c:3426","","autovacuum worker",,0
2025-07-28 05:42:47.696 CST,,,1782,,68869669.6f6,6,,2025-07-28 05:13:13 CST,,0,LOG,00000,"server process (PID 4291) was terminated by signal 6: Aborted","Failed process was running: autovacuum: VACUUM ANALYZE public.big_table",,,,,,,"LogChildExit, postmaster.c:3770","","postmaster",,0
2 分析过程
2.1 确认空间消耗
[postgres@pgc ~]$ du -m /|sort -n
4166 /postgresql/pgdata/14/15/1721/base/41025
34399 /postgresql/pgdata/14/15/1721/base/13025
38600 /postgresql/pgdata/14/15/1721/base
44545 /postgresql/pgdata/14/15/1721/pg_wal
83346 /postgresql/pgdata
83346 /postgresql/pgdata/14
83346 /postgresql/pgdata/14/15
83346 /postgresql/pgdata/14/15/1721
112771 /postgresql/archive
112771 /postgresql/archive/1721
196194 /postgresql
199240 /
这里看到归档目录耗了100G左右。
2.2 按小时分析归档文件个数
注意PG中归档日志的大小跟Oracle有一些区别,归档文件在操作系统中的大小并不代码着实际消耗的大小。比如连续的switch wal,生成的归档日志文件大小都是跟wal一样的大的。
[postgres@pgc pgsql]$ sh ./archive_number_count.sh /postgresql/archive/1721
Date Hour Count Size(MB)
2025-07-27 05 1 512.00
2025-07-28 05 83 42496.00
2025-07-28 21 51 26112.00
2025-07-28 23 2 1024.00
2.3 分析wal的内容
这里随便选择一个归档日志来分析:
[postgres@pgc pgsql]$ sh ./archive_count_sql.sh /postgresql/archive/1721/000000080000002600000001
Operation RelFileNode Count Total_Bytes
LOG 1663/13025/65648 76056 7529544
INSERT 1663/13025/65649 2509865 358910695
FPI_FOR_HINT 1663/13025/65649 9 441
INSERT_LEAF 1663/13025/65655 2503007 160192448
SPLIT_R 1663/13025/65655 6882 5206752
INSERT_UPPER 1663/13025/65655 6858 493776
这里看到在1663/13025/65649上存在大量的insert操作,生成的日志大概是358910695。
3.4 查询对象及定位SQL语句
SELECT
c.relfilenode,
c.relname,
n.nspname AS schema_name,
CASE c.relkind
WHEN 'r' THEN 'table'
WHEN 'i' THEN 'index'
WHEN 't' THEN 'TOAST'
ELSE c.relkind
END AS reltype
FROM
pg_class c
JOIN
pg_namespace n ON c.relnamespace = n.oid
WHERE
c.relfilenode IN (65649, 65655);
relfilenode | relname | schema_name | reltype
-------------+----------------+-------------+---------
65649 | big_table | public | t
65655 | big_table_pkey | public | i
可以定位到对象信息。
[postgres@pgc log]$ grep -Ei 'big_table' postgresql-2025-07-28_051314.csv
Query Text: INSERT INTO big_table (data1, data2, data3, data4)
Insert on public.big_table (cost=0.00..12500000.00 rows=0 width=0) (actual time=1345310.410..1345310.412 rows=0 loops=1)
Output: nextval('big_table_id_seq'::regclass), ""*SELECT*"".md5, ""*SELECT*"".md5_1, ""*SELECT*"".int4, ""*SELECT*"".""?column?""
定位到SQL语句
3 总结
上面只是简单的记录了如果快速的分析PG环境中归档日志耗尽,帮助我们快速的得到对应的对象和SQL语句,提升我们故障分析的效率。
——————作者介绍———————–
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)
PG故障处理:PG归档空间耗尽案例分析:等您坐沙发呢!