Skip to content

Conversation

@the-glu
Copy link
Contributor

@the-glu the-glu commented Dec 3, 2025

The implementation of a single IP for the yugabyte master/tserver in #1259 had a hidden issue: using different ports on different services don't work well in GKE (and minikube), as the loadbalancer may target a tserver pod on a master port or a master pod on a tserver port. This wasn't spotted before as connection are retried and established after a few tries.

This PR fix the issue by using an intermediate, internal proxy, that will redirect tserver ports only to the tserver, and master ports only to the master services. Two proxies are used for redundancy, and they simply forward tcp connections, on layer 4.

image

The new proxy also only forward 'secure' port and drop non-encrypted ones, used only for yugabyte's web interface, contributing to #1214. yb-admin, used to manage the cluster stills works as it use secure ports.

Documentation has also been updated.

Tested in a aws|helm <-> gks|tanka cluster, working as expected.

Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you investigated using Network load balancers targeting node ports directly ? My expectation would be that the nodes will always respond on the two services since the two kind of ports are expected on each nodes.
See https://docs.cloud.google.com/kubernetes-engine/docs/concepts/service-load-balancer#effect_of_externaltrafficpolicy, especially externalTrafficPolicy: Local.

@barroco
Copy link
Contributor

barroco commented Dec 22, 2025

After further research and discussions, my previous comment does not take into account the fact that a cluster may have more nodes than the desired yugabyte replication.
One possible alternative would be to have the master and the tserver in the same pod. Since it is managed by the yugabyte helm chart at the moment, we would probably not like to push in that direction.
I am still investigating if there is no other alternative to ensure we are confident there is no supported alternatives.
A final alternative would be to accept that for GKE, more load balancers are required.


In the meantime, could you please split the PR to isolate the public port adjustment ?
Public Ports removed here: https://github.com/interuss/dss/pull/1310/changes#diff-fe7a89f92906834ad960b33ffd8e65731f8abaea4f10478c0ef90d2b0074ca71L205
Thanks

@barroco
Copy link
Contributor

barroco commented Dec 22, 2025

Please split the documentation formatting improvement as well. Thank you.

@the-glu
Copy link
Contributor Author

the-glu commented Dec 23, 2025

Yes, ports and doc part splited into #1326 and #1327 .

I keep that one as-is until future verifications or others ones are merged for a proper rebase.

@the-glu the-glu marked this pull request as draft December 29, 2025 14:14
Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move forward with this option for the moment. We may optimize later. In any case we want a solution to avoid expensive limitations involving unreasonable amount of cloud resources. Can you please make it optional to google deployments since AWS should not require it ?

There’s an upside to choosing this option knowing that some yugabyte services are unsecure, we may use haproxy to secure them if we need to expose it to other participants.
In addition, it could be an solution to #1139


---
apiVersion: v1
kind: Secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a configmap not be more appropriate ?

name: yugabyte-proxy-{{$i}}
name: yugabyte-proxy-{{$i}}
spec:
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment to highlight the fact that it is two replicas per DSS node and the rationale behind it ?

server yb-tserver-{{$i}} yb-tserver-{{$i}}.yb-tservers.default.svc.cluster.local:9100 check resolvers dns
---
apiVersion: apps/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component shall be documented somewhere. Here for instance: https://github.com/interuss/dss/edit/master/docs/architecture.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants