Proposal: support Master-Slave architecture among components #1380

robocanic · 2025-12-30T13:02:31Z

robocanic
Dec 30, 2025
Collaborator

English version

Background

In the current architecture of Admin, Discovery and Engine are two components responsible for watching the registry and runtime engine, respectively. All data sources for Admin originate from these two components. When Admin is deployed with multiple replicas, each replica performs list-watch operations and stores the retrieved data into its local store. This process can lead to data inconsistency due to various issues like network latency. Therefore, these data-source components need to evolve toward a Master-Slave architecture, ensuring that at any given time, only one replica actively performs list-watch operations and writes data into the store.

Solutions

Kubernetes Lease

The first possible approach is to leverage Kubernetes leases for leader election. Many kubernetes-native applications use leases for this purpose; however, this method requires that the Engine component must be of Kubernetes type.

Gorm + DB

The second possible approach is to implement leader election ourselves using GORM and a database, leveraging the atomicity of database operations—such as UPDATE ... WHERE combined with unique constraints or version numbers—to implement a lease mechanism and thereby elect a single leader.

Tips

Currently, Admin's component lifecycle methods include init and start. Leader election should be performed during the start phase. Only components that require leader election should execute the election logic; follower nodes' corresponding components must not execute business logic. When the leader node fails, the remaining follower nodes should be able to re-elect a new leader to take over and execute the relevant business logic.

中文版本

背景

在当前的 Admin 架构中，Discovery 和 Engine 是两个分别负责监控注册中心和运行时引擎的组件。Admin 的所有数据源都来自这两个组件。当 Admin 以多副本方式部署时，每个副本都会执行 list-watch 操作，并将获取到的数据存储到store中。由于网络延迟等多种因素，这一过程可能会导致数据不一致。因此，这些数据源组件需要演进为主从（Master-Slave）架构，以确保在任何时刻只有一个副本主动执行 list-watch 操作并将数据写入store。

PS: 上面说到的store，应该特指DB存储。

方案

Kubernetes Lease

第一种可行方案是利用 Kubernetes 的 Lease（租约）机制进行 Leader 选举。许多 Kubernetes 原生应用都采用这种方式；但该方法的前提是 Engine 组件必须是 Kubernetes 类型。

Gorm + 数据库

第二种可行方案是基于 GORM 和数据库自行实现 Leader 选举，利用数据库操作的原子性——例如结合 UPDATE ... WHERE 语句以及唯一约束或版本号——来实现租约机制，从而选举出唯一的 Leader。

提示

目前，Admin 的组件生命周期方法包括 init 和 start。Leader 选举应在 start 阶段执行。只有需要进行 Leader 选举的组件才应执行选举逻辑；处于 Follower 状态的节点，其对应组件不得执行业务逻辑。当 Leader 节点发生故障时，其余的 Follower 节点应能够重新选举出新的 Leader，接管并执行业务逻辑。

tew-axiom · 2026-01-13T13:15:28Z

tew-axiom
Jan 13, 2026

Some initial ideas for the "Gorm + DB" solution

Lease Design

Based on the following considerations:

Need to ensure only one lease record exists per component
Define lease expiration conditions

We need a table with roughly the following structure:

CREATE TABLE component_lease (
    -- Primary key: component name (e.g., "discovery", "engine")
    component_name VARCHAR(64) PRIMARY KEY,
    -- Current holder identifier (pod_name or hostname+pid)
    holder_id VARCHAR(128) NOT NULL,
    -- Lease acquisition time (millisecond timestamp)
    lease_time BIGINT NOT NULL,
    -- Last renewal time (millisecond timestamp)
    renew_time BIGINT NOT NULL,
    -- Optimistic lock version number
    version INT NOT NULL DEFAULT 0,
    -- Lease TTL (seconds)
    ttl INT NOT NULL DEFAULT 30,
);

Now we can perform leader election through the following approaches or strategies, as illustrated:

Initial Competition - Attempt to INSERT lease record
- Success → Become Leader
- Failure (unique key conflict) → Proceed to next step
Seize Expired Lease - UPDATE WHERE expiration condition
For example:
```
... WHERE component_name = 'discovery' AND renew_time < (NOW() - ttl * 1000)
```
At this point, we can determine the current node's role based on the following feedback:
- RowsAffected > 0 → Become Leader
- RowsAffected = 0 → Become Follower
Periodic Heartbeat
- Leader: Renew every TTL/3 interval (e.g., if TTL=30s, renew every 10s)
- Follower: Attempt to compete every TTL/3 interval
Loss of Leadership
- Renewal failure (database exception/lease preemption) → Stop business logic → Transition to Follower

Other Details

If the database is master-slave, how to ensure lease operates normally? Force routing to master? How to handle master failure? ProxySQL?
Lease lifecycle management itself, and integration with existing component lifecycle
Edge cases:
- Boundary handling when multiple nodes start simultaneously
- Use database timestamps instead of server/application time to avoid clock skew

1 reply

robocanic Jan 18, 2026
Collaborator Author

DB can be seen stateless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: support Master-Slave architecture among components #1380

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: support Master-Slave architecture among components #1380

Uh oh!

Uh oh!

robocanic Dec 30, 2025 Collaborator

Background

Solutions

Kubernetes Lease

Gorm + DB

Tips

背景

方案

Kubernetes Lease

Gorm + 数据库

提示

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

tew-axiom Jan 13, 2026

Some initial ideas for the "Gorm + DB" solution

Lease Design

Other Details

Uh oh!

robocanic Jan 18, 2026 Collaborator Author

robocanic
Dec 30, 2025
Collaborator

Replies: 1 comment 1 reply

tew-axiom
Jan 13, 2026

robocanic Jan 18, 2026
Collaborator Author