Add recovery mechanism for node failure scenarios. #277

ibrarahmad · 2025-12-01T14:47:17Z

This commit introduces a comprehensive recovery system that enables data synchronization when a node fails in a multi-node replication cluster. The implementation includes support for rescue subscriptions that can recover missing transactions from peer nodes using recovery slots and forwarding mechanisms. The recovery system addresses the critical need to maintain data consistency across all nodes in a distributed replication environment when one or more nodes become unavailable or fall behind in replication progress.

The recovery system tracks subscription state through additional fields that indicate rescue status, temporary subscription flags, and recovery boundaries defined by LSN positions and timestamps. These fields allow the system to distinguish between normal subscriptions and temporary rescue subscriptions created during recovery operations. Recovery slots preserve WAL history for rescue operations, allowing lagging nodes to catch up by replaying transactions from a more advanced peer node that has successfully received and applied the missing data.

A forwarding-based recovery procedure configures subscriptions to forward transactions from failed node origins, enabling automatic recovery without requiring manual WAL replay. This approach leverages the existing replication infrastructure to cascade transactions from nodes that have the missing data to nodes that need to catch up. The forwarding mechanism works by updating subscription parameters to include all transaction origins, ensuring that transactions originally from the failed node are propagated through the replication topology to reach the lagging node.

The system includes helper functions for monitoring recovery progress and verifying data consistency across nodes. These functions allow administrators to track the status of recovery operations, verify that data has been successfully synchronized, and ensure that all nodes have reached a consistent state. The recovery process can be monitored through subscription status views and custom recovery status functions that report the current state of rescue operations.

Recovery slots are managed through a dedicated shared memory context that tracks active recovery slots across the cluster. The recovery slot management system ensures that WAL is preserved for rescue operations by maintaining logical replication slots that can be cloned for use by rescue subscriptions. The slot management includes mechanisms to advance recovery slots to the minimum position across all peer subscriptions, ensuring that historical transactions remain available for recovery operations even as normal replication progresses.

The implementation includes a cluster management script that facilitates testing and demonstration of recovery scenarios. This script automates the creation of multi-node replication clusters, simulates node failures, and verifies recovery operations. The script provides detailed output about the state of each node including row counts, LSN positions, and subscription statuses, making it easier to understand and debug recovery scenarios.

Recovery operations are designed to be transparent to applications running on the cluster. The system automatically handles the creation and cleanup of temporary rescue subscriptions, ensuring that recovery operations do not interfere with normal replication once recovery is complete. The recovery system integrates seamlessly with the existing subscription management infrastructure, allowing recovery to proceed without manual intervention once initiated.

This commit introduces a comprehensive recovery system that enables data synchronization when a node fails in a multi-node replication cluster. The implementation includes support for rescue subscriptions that can recover missing transactions from peer nodes using recovery slots and forwarding mechanisms. The recovery system addresses the critical need to maintain data consistency across all nodes in a distributed replication environment when one or more nodes become unavailable or fall behind in replication progress. The recovery system tracks subscription state through additional fields that indicate rescue status, temporary subscription flags, and recovery boundaries defined by LSN positions and timestamps. These fields allow the system to distinguish between normal subscriptions and temporary rescue subscriptions created during recovery operations. Recovery slots preserve WAL history for rescue operations, allowing lagging nodes to catch up by replaying transactions from a more advanced peer node that has successfully received and applied the missing data. A forwarding-based recovery procedure configures subscriptions to forward transactions from failed node origins, enabling automatic recovery without requiring manual WAL replay. This approach leverages the existing replication infrastructure to cascade transactions from nodes that have the missing data to nodes that need to catch up. The forwarding mechanism works by updating subscription parameters to include all transaction origins, ensuring that transactions originally from the failed node are propagated through the replication topology to reach the lagging node. The system includes helper functions for monitoring recovery progress and verifying data consistency across nodes. These functions allow administrators to track the status of recovery operations, verify that data has been successfully synchronized, and ensure that all nodes have reached a consistent state. The recovery process can be monitored through subscription status views and custom recovery status functions that report the current state of rescue operations. Recovery slots are managed through a dedicated shared memory context that tracks active recovery slots across the cluster. The recovery slot management system ensures that WAL is preserved for rescue operations by maintaining logical replication slots that can be cloned for use by rescue subscriptions. The slot management includes mechanisms to advance recovery slots to the minimum position across all peer subscriptions, ensuring that historical transactions remain available for recovery operations even as normal replication progresses. The implementation includes a cluster management script that facilitates testing and demonstration of recovery scenarios. This script automates the creation of multi-node replication clusters, simulates node failures, and verifies recovery operations. The script provides detailed output about the state of each node including row counts, LSN positions, and subscription statuses, making it easier to understand and debug recovery scenarios. Recovery operations are designed to be transparent to applications running on the cluster. The system automatically handles the creation and cleanup of temporary rescue subscriptions, ensuring that recovery operations do not interfere with normal replication once recovery is complete. The recovery system integrates seamlessly with the existing subscription management infrastructure, allowing recovery to proceed without manual intervention once initiated.

Ibrar Ahmed added 2 commits December 1, 2025 19:45

Remove unnccary functions.

228e362

ibrarahmad closed this Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add recovery mechanism for node failure scenarios. #277

Add recovery mechanism for node failure scenarios. #277

Uh oh!

ibrarahmad commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add recovery mechanism for node failure scenarios. #277

Add recovery mechanism for node failure scenarios. #277

Uh oh!

Conversation

ibrarahmad commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant