PG系列：pg_probackup的时间点恢复失败案例分析

我们的文章会在微信公众号IT民工的龙马人生和博客网站( www.htz.pw )同步更新，欢迎关注收藏，也欢迎大家转载，但是请在文章开始地方标注文章出处，谢谢！
由于博客中有大量代码，通过页面浏览效果更佳。

今天在和朋友聊到PG环境中那个备份的工具最好用时，他提到pg_probackup在时间点恢复时存在bug，”无法跨时间线做时间点恢复“，不建议在生产环境中使用。其实想一想，基于时间线的恢复是一个非常基础功能，pg_probackup工具做为PG环境中非常专业的备份工具了，经过这么多年迭代和应用，应该不会出现这么低级的BUG，并且时间点恢复本身是PG自带功能，跟pg_probackup也没有关系。向朋友拿到对方的操作流程，在自己的环境中模拟了一遍。现象还真模拟出来了，结论就是：不是bug，但是工具的官方手册确实存在一些不严谨，容易出现一下错误的理解，误导大家的操作，下面是整个过程：

环境介绍

MacOS环境中的PG数据库，版本为14.15

yihan:1721 yihan$ psql  -c "SELECT version();"
                                                            version                                                            
-------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 14.15 (Homebrew) on aarch64-apple-darwin24.1.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.4), 64-bit
(1 row)

yihan:1721 yihan$  pg_probackup --version
pg_probackup 2.5.15 (PostgreSQL 14.15 (Homebrew))

对数据库进行备份

这里直接跳过pg_probackup工具的配置环境，如果需要了解，直接google一下有很多文章

pg_probackup backup -B /postgresql/probackup/1721 --instance=1721 -b full

这里备份的日志跳过。

还原测试

这里多模拟一次还原、打开数据库，还原、打开数据库，每一次还原后打开数据库，时间线都会增加一，这样就就可以模拟我们后面的跨时间线恢复的案例了。

yihan:15 yihan$ pg_probackup restore -B /postgresql/probackup/1721 --instance=1721  --recovery-target="latest"
INFO: Tablespace 24639 will be restored using old path "/postgresql/pgdata/14/15/tblspace"
INFO: Validating backup T0WNSK
INFO: Backup T0WNSK data files are valid
INFO: Backup T0WNSK WAL segments are valid
INFO: Backup T0WNSK is valid.
INFO: Restoring the database from backup T0WNSK
INFO: Start restoring backup files. PGDATA size: 16GB
INFO: Backup files are restored. Transfered bytes: 16GB, time elapsed: 27s
INFO: Restore incremental ratio (less is better): 100% (16GB/16GB)
INFO: Syncing restored files to disk
INFO: Restored backup files are synced, time elapsed: 0
INFO: Restore of backup T0WNSK completed.

打开数据库

yihan:1721 yihan$ pg_ctl start -D $PGDATA
yihan:1721 yihan$ psql -c "select pg_wal_replay_resume();"

模拟时间点恢复

获取还原点

这里删除了一张test表，手动做了一次wal的切换，wal的信息为：9E/D9D0F8D8

postgres=# select pg_wal_replay_resume();
postgres=# drop table test;
DROP TABLE
postgres=# select pg_switch_wal();
 pg_switch_wal
---------------
 9E/D9D0F8D8
(1 row)

postgres=# SELECT pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name
--------------------------
 000000050000009E00000007
(1 row)

这里注意wal的日志文件前面的timelineid为5，这个信息我们在后面会用到

基于wal位置来还原

数据库全库还原

yihan:15 yihan$ pg_probackup restore -B /postgresql/probackup/1721 --instance=1721  --recovery-target-lsn="9E/D9D0F8D8"
INFO: Tablespace 24639 will be restored using old path "/postgresql/pgdata/14/15/tblspace"
INFO: Validating backup T0WNSK
INFO: Backup T0WNSK data files are valid
INFO: Backup validation completed successfully on time 2025-08-13 08:01:07+08, xid 20560148 and LSN 9E/D9D11708
INFO: Backup T0WNSK is valid.
INFO: Restoring the database from backup T0WNSK
INFO: Start restoring backup files. PGDATA size: 16GB
INFO: Backup files are restored. Transfered bytes: 16GB, time elapsed: 28s
INFO: Restore incremental ratio (less is better): 100% (16GB/16GB)
INFO: Syncing restored files to disk
INFO: Restored backup files are synced, time elapsed: 0
INFO: Restore of backup T0WNSK completed.

前滚数据库

启动数据库时，PG会自动前滚数据库到还原时间点。

yihan:15 yihan$ pg_ctl start -D $PGDATA
waiting for server to start....2025-08-13 08:25:18 CST [57492]: [1-1] user=,db= WARNING:  42704: unrecognized configuration parameter "pg_store_plans.verbose"
2025-08-13 08:25:18 CST [57492]: [2-1] user=,db= LOCATION:  EmitWarningsOnPlaceholders, guc.c:9367
2025-08-13 08:25:19 CST [57492]: [3-1] user=,db= LOG:  00000: redirecting log output to logging collector process
2025-08-13 08:25:19 CST [57492]: [4-1] user=,db= HINT:  Future log output will appear in directory "log".
2025-08-13 08:25:19 CST [57492]: [5-1] user=,db= LOCATION:  SysLogger_Start, syslogger.c:674
. done
server started

还原失败

这里看到test表仍然还在，说明基于还原点的还原失败

postgres=# select count(*) from test;
 count
-------
   432
(1 row)

分析后台日志

2025-08-13 08:25:19 CST [57492]: [11-1] user=,db= LOCATION:  StreamServerPort, pqcomm.c:575
2025-08-13 08:25:19 CST [57494]: [1-1] user=,db= LOG:  00000: database system was interrupted; last known up at 2025-08-13 07:55:32 CST
2025-08-13 08:25:19 CST [57494]: [2-1] user=,db= LOCATION:  StartupXLOG, xlog.c:6590
2025-08-13 08:25:19 CST [57494]: [3-1] user=,db= LOG:  00000: creating missing WAL directory "pg_wal/archive_status"
2025-08-13 08:25:19 CST [57494]: [4-1] user=,db= LOCATION:  ValidateXLOGDirectoryStructure, xlog.c:4350
2025-08-13 08:25:19 CST [57494]: [5-1] user=,db= LOG:  00000: starting point-in-time recovery to WAL location (LSN) "9E/D9D0F8D8"
2025-08-13 08:25:19 CST [57494]: [6-1] user=,db= LOCATION:  StartupXLOG, xlog.c:6668
INFO: pg_probackup archive-get WAL file: 00000003.history, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 0ms
2025-08-13 08:25:19 CST [57494]: [7-1] user=,db= LOG:  00000: restored log file "00000003.history" from archive
2025-08-13 08:25:19 CST [57494]: [8-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
INFO: pg_probackup archive-get WAL file: 000000030000009E00000002, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 797ms
2025-08-13 08:25:20 CST [57494]: [9-1] user=,db= LOG:  00000: restored log file "000000030000009E00000002" from archive
2025-08-13 08:25:20 CST [57494]: [10-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
2025-08-13 08:25:20 CST [57494]: [11-1] user=,db= LOG:  00000: redo starts at 9E/40000060
2025-08-13 08:25:20 CST [57494]: [12-1] user=,db= LOCATION:  StartupXLOG, xlog.c:7393
INFO: pg_probackup archive-get WAL file: 000000030000009E00000003, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 705ms
2025-08-13 08:25:20 CST [57494]: [13-1] user=,db= LOG:  00000: restored log file "000000030000009E00000003" from archive
2025-08-13 08:25:20 CST [57494]: [14-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
2025-08-13 08:25:20 CST [57494]: [15-1] user=,db= LOG:  00000: consistent recovery state reached at 9E/60000128
2025-08-13 08:25:20 CST [57494]: [16-1] user=,db= LOCATION:  CheckRecoveryConsistency, xlog.c:8365
2025-08-13 08:25:20 CST [57492]: [12-1] user=,db= LOG:  00000: database system is ready to accept read-only connections
2025-08-13 08:25:20 CST [57492]: [13-1] user=,db= LOCATION:  sigusr1_handler, postmaster.c:5233
INFO: pg_probackup archive-get WAL file: 000000030000009E00000004, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 866ms
2025-08-13 08:25:21 CST [57494]: [17-1] user=,db= LOG:  00000: restored log file "000000030000009E00000004" from archive
2025-08-13 08:25:21 CST [57494]: [18-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
INFO: pg_probackup archive-get WAL file: 000000030000009E00000005, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 740ms
2025-08-13 08:25:23 CST [57494]: [19-1] user=,db= LOG:  00000: restored log file "000000030000009E00000005" from archive
2025-08-13 08:25:23 CST [57494]: [20-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
INFO: pg_probackup archive-get WAL file: 000000030000009E00000006, remote: none, threads: 1/1, batch: 1
INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 711ms
2025-08-13 08:25:24 CST [57494]: [21-1] user=,db= LOG:  00000: restored log file "000000030000009E00000006" from archive
2025-08-13 08:25:24 CST [57494]: [22-1] user=,db= LOCATION:  RestoreArchivedFile, xlogarchive.c:215
2025-08-13 08:25:26 CST [57494]: [23-1] user=,db= LOG:  00000: recovery stopping after WAL location (LSN) "9E/D9D11708"
2025-08-13 08:25:26 CST [57494]: [24-1] user=,db= LOCATION:  recoveryStopsAfter, xlog.c:5993
2025-08-13 08:25:26 CST [57494]: [25-1] user=,db= LOG:  00000: pausing at the end of recovery
2025-08-13 08:25:26 CST [57494]: [26-1] user=,db= HINT:  Execute pg_wal_replay_resume() to promote.
2025-08-13 08:25:26 CST [57494]: [27-1] user=,db= LOCATION:  recoveryPausesHere, xlog.c:6113

从日志中看到recovery stopping after WAL location (LSN) "9E/D9D11708"还原点，跟我们上面接的不一样。wal日志文件为：000000030000009E00000006，这里看到跟我们前面的000000050000009E00000007中的时间线完全不一样。
所以从这里看还原失败了。

原因分析

查看auto.conf的文件，这里看到recovery_target_timeline为current的值，说明只还原到当前时间线，跟我们期待的时间线为5完全不一样了。

recovery_target_lsn = '9E/D9D0F8D8'
recovery_target_timeline = 'current'
recovery_target_action = 'pause'

故障解决方案

手动指定–recovery-target-timeline=为latest的值

yihan:15 yihan$ pg_probackup restore -B /postgresql/probackup/1721 --instance=1721  --recovery-target-xid=20560167  --recovery-target-timeline='latest'
INFO: Tablespace 24639 will be restored using old path "/postgresql/pgdata/14/15/tblspace"
INFO: Validating backup T0WNSK
INFO: Backup T0WNSK data files are valid
INFO: Backup validation completed successfully on time 2025-08-13 08:01:07+08, xid 20560167 and LSN 9E/D9D13898
INFO: Backup T0WNSK is valid.
INFO: Restoring the database from backup T0WNSK
INFO: Start restoring backup files. PGDATA size: 16GB
INFO: Backup files are restored. Transfered bytes: 16GB, time elapsed: 26s
INFO: Restore incremental ratio (less is better): 100% (16GB/16GB)
INFO: Syncing restored files to disk
INFO: Restored backup files are synced, time elapsed: 0
INFO: Restore of backup T0WNSK completed.


# recovery settings added by pg_probackup restore of backup T0WNSK at '2025-08-13 21:35:28+08'
## recovery settings
recovery_target_xid = '20560167'
recovery_target_timeline = 'latest'
recovery_target_action = 'pause'



yihan:1721 yihan$ pg_ctl start -D $PGDATA
waiting for server to start....2025-08-13 21:38:23 CST [86937]: [1-1] user=,db= WARNING:  42704: unrecognized configuration parameter "pg_store_plans.verbose"
2025-08-13 21:38:23 CST [86937]: [2-1] user=,db= LOCATION:  EmitWarningsOnPlaceholders, guc.c:9367
2025-08-13 21:38:23 CST [86937]: [3-1] user=,db= LOG:  00000: redirecting log output to logging collector process
2025-08-13 21:38:23 CST [86937]: [4-1] user=,db= HINT:  Future log output will appear in directory "log".
2025-08-13 21:38:23 CST [86937]: [5-1] user=,db= LOCATION:  SysLogger_Start, syslogger.c:674
. done
server started


postgres=# select count(*) from test;
ERROR:  relation "test" does not exist
LINE 1: select count(*) from test;
                             ^

探索current的原因

在pg_probackup中的restore.c的源代码中有如下的代码，下面一段代码是关于restore命令的参数初始化的过程，代码如下：

static void
print_recovery_settings(InstanceState *instanceState, FILE *fp, pgBackup *backup,
							   pgRestoreParams *params, pgRecoveryTarget *rt)
{
	char restore_command_guc[16384];
	fio_fprintf(fp, "## recovery settings\n");

	/* If restore_command is provided, use it. Otherwise construct it from scratch. */
	if (instance_config.restore_command &&
		(pg_strcasecmp(instance_config.restore_command, "none") != 0))
		sprintf(restore_command_guc, "%s", instance_config.restore_command);
	else
	{
		/* default cmdline, ok for local restore */
		sprintf(restore_command_guc, "\"%s\" archive-get -B \"%s\" --instance \"%s\" "
				"--wal-file-path=%%p --wal-file-name=%%f",
				PROGRAM_FULL_PATH ? PROGRAM_FULL_PATH : PROGRAM_NAME,
				/* TODO What is going on here? Why do we use catalog path as wal-file-path? */
				instanceState->catalog_state->catalog_path, instanceState->instance_name);

		/* append --remote-* parameters provided via --archive-* settings */
		if (instance_config.archive.host)
		{
			strcat(restore_command_guc, " --remote-host=");
			strcat(restore_command_guc, instance_config.archive.host);
		}

		if (instance_config.archive.port)
		{
			strcat(restore_command_guc, " --remote-port=");
			strcat(restore_command_guc, instance_config.archive.port);
		}

		if (instance_config.archive.user)
		{
			strcat(restore_command_guc, " --remote-user=");
			strcat(restore_command_guc, instance_config.archive.user);
		}
	}

	/*
	 * We've already checked that only one of the four following mutually
	 * exclusive options is specified, so the order of calls is insignificant.
	 */
	if (rt->target_name)
		fio_fprintf(fp, "recovery_target_name = '%s'\n", rt->target_name);

	if (rt->time_string)
		fio_fprintf(fp, "recovery_target_time = '%s'\n", rt->time_string);

	if (rt->xid_string)
		fio_fprintf(fp, "recovery_target_xid = '%s'\n", rt->xid_string);

	if (rt->lsn_string)
		fio_fprintf(fp, "recovery_target_lsn = '%s'\n", rt->lsn_string);

	if (rt->target_stop && (strcmp(rt->target_stop, "immediate") == 0))
		fio_fprintf(fp, "recovery_target = '%s'\n", rt->target_stop);

	if (rt->inclusive_specified)
		fio_fprintf(fp, "recovery_target_inclusive = '%s'\n",
				rt->target_inclusive ? "true" : "false");

	if (rt->target_tli)
		fio_fprintf(fp, "recovery_target_timeline = '%u'\n", rt->target_tli);
	else
	{
		if (rt->target_tli_string)
			fio_fprintf(fp, "recovery_target_timeline = '%s'\n", rt->target_tli_string);
		else if (rt->target_stop && (strcmp(rt->target_stop, "latest") == 0))
			fio_fprintf(fp, "recovery_target_timeline = 'latest'\n");
#if PG_VERSION_NUM >= 120000
		else
		{
			/*
			 * In PG12 default recovery target timeline was changed to 'latest', which
			 * is extremely risky. Explicitly preserve old behavior of recovering to current
			 * timneline for PG12.
			 */
			fio_fprintf(fp, "recovery_target_timeline = 'current'\n");
		}
#endif
	}

	if (rt->target_action)
		fio_fprintf(fp, "recovery_target_action = '%s'\n", rt->target_action);
	else
		/* default recovery_target_action is 'pause' */
		fio_fprintf(fp, "recovery_target_action = '%s'\n", "pause");

	elog(LOG, "Setting restore_command to '%s'", restore_command_guc);
	fio_fprintf(fp, "restore_command = '%s'\n", restore_command_guc);
}

其中最关键的代码为fio_fprintf(fp, "recovery_target_timeline = ‘current’\n");前后几行代码，大概的意思就是如果时间线没有配置，那么recovery_target_timeline的默认值为current，这个就是为什么在做时间点恢复时出现BUG，未达到预期的目的。

思考的内容

在数据文件还原时能否提前对还原点根据timelineid.history做自动化的判断，如果能判断到唯一的时间线，那么自动修改参数，如果在多个时间线，那么给出提示和警告，有相关的人员手动确定时间线。

总结

这个案例是BUG又不是BUG，其实有很多类似的问题和故障，核心的原因还是我们自己对知识点不熟悉导致考虑不周全而出现的错误，又由于知识不全面而无法分析到原因。在开源数据库和开源工具中很好的弥补了前面的问题，可以通过对源码的分析来找到问题的原因。此时让我想起了一句话：”一白遮百丑“，其实在当前数据库运维行业中也是一样，如果能完全理解源代码，那么能很好的屏蔽在其它方面的不足。时代在变，数据库运维行业已经从原来的通过经验累积的模式过度到更有知识的深度、更有见识的广度的时代了。

——————作者介绍———————–
姓名：黄廷忠
现就职：Oracle中国高级服务团队
曾就职：OceanBase、云和恩墨、东方龙马等
电话、微信、QQ：18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)