-
Notifications
You must be signed in to change notification settings - Fork 135
feat: experimental location aware routing- Part 4 #4295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds the server abstraction interfaces for location-aware routing: - ChannelFinderServer: Interface representing a Spanner server endpoint with address, health check, and channel access - ChannelFinderServerFactory: Factory interface for creating and caching server connections - GrpcChannelFinderServerFactory: gRPC implementation that creates and manages gRPC channels for different server endpoints These interfaces enable the client to maintain connections to multiple Spanner servers and route requests directly to the appropriate server based on key location information. This is part of the experimental location-aware routing for improved latency.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @rahul2393, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a sophisticated client-side key range caching system for Google Cloud Spanner. The primary goal is to enhance the efficiency of request routing by allowing the client to intelligently determine the optimal Spanner backend server for a given key or key range. This is achieved through a hierarchical cache that stores information about key ranges, the groups of tablets responsible for those ranges, and the individual tablets and their associated gRPC channels, complete with health checks and leader preference. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a KeyRangeCache for routing information, which is a valuable performance enhancement. However, the current implementation has several critical issues that need to be addressed. There are significant concerns regarding thread safety and resource management, including memory and channel leaks stemming from an incomplete reference counting mechanism. Additionally, there are some medium-severity issues, such as the use of System.out.println for logging. I've provided specific comments and suggestions to resolve these problems.
| synchronized (channelBuilder) { | ||
| InstantiatingGrpcChannelProvider.Builder newBuilder = | ||
| channelBuilder.setEndpoint(addr); | ||
| return new GrpcChannelFinderServer(addr, newBuilder.build()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The create method has a concurrency issue. It modifies the shared channelBuilder by calling setEndpoint(), but it doesn't restore the original endpoint. This can lead to other threads creating channels for incorrect endpoints. You should save and restore the original endpoint within the synchronized block to ensure thread safety.
synchronized (channelBuilder) {
String originalEndpoint = channelBuilder.getEndpoint();
try {
InstantiatingGrpcChannelProvider.Builder newBuilder =
channelBuilder.setEndpoint(addr);
return new GrpcChannelFinderServer(addr, newBuilder.build());
} finally {
channelBuilder.setEndpoint(originalEndpoint);
}
}| private final NavigableMap<ByteString, CachedRange> ranges = | ||
| new TreeMap<>(ByteString.unsignedLexicographicalComparator()); | ||
|
|
||
| // Groups indexed by group_uid | ||
| private final Map<Long, CachedGroup> groups = new HashMap<>(); | ||
|
|
||
| // Servers indexed by address - shared across all tablets | ||
| private final Map<String, ServerEntry> servers = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is not thread-safe. It uses non-concurrent collections like TreeMap and HashMap for its internal state (ranges, groups, servers), and the public methods that access this state (addRanges, fillRoutingInfo, clear, debugString) are not synchronized. If this cache is used by multiple threads, this will lead to race conditions and ConcurrentModificationException. Consider making the public methods synchronized or using concurrent collections.
| if (!serverAddress.equals(tabletIn.getServerAddress())) { | ||
| serverAddress = tabletIn.getServerAddress(); | ||
| server = null; // Will be lazily initialized | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a tablet's server address changes, the old server reference is nulled out, but its ServerEntry is not unreferenced. This leads to a resource leak. You should unref the old server entry before assigning the new address.
| if (!serverAddress.equals(tabletIn.getServerAddress())) { | |
| serverAddress = tabletIn.getServerAddress(); | |
| server = null; // Will be lazily initialized | |
| } | |
| if (!serverAddress.equals(tabletIn.getServerAddress())) { | |
| if (server != null) { | |
| ServerEntry entry = servers.get(serverAddress); | |
| unref(entry); | |
| } | |
| serverAddress = tabletIn.getServerAddress(); | |
| server = null; // Will be lazily initialized | |
| } |
| return; | ||
| } | ||
| } | ||
|
|
||
| // Tablets changed - rebuild the list, reusing existing tablets where possible | ||
| System.out.println("DEBUG [BYPASS]: Rebuilding tablet list"); | ||
| Map<Long, CachedTablet> tabletsByUid = new HashMap<>(); | ||
| for (CachedTablet tablet : tablets) { | ||
| tabletsByUid.put(tablet.tabletUid, tablet); | ||
| } | ||
|
|
||
| List<CachedTablet> newTablets = new ArrayList<>(groupIn.getTabletsCount()); | ||
| for (int t = 0; t < groupIn.getTabletsCount(); t++) { | ||
| Tablet tabletIn = groupIn.getTablets(t); | ||
| CachedTablet tablet = tabletsByUid.get(tabletIn.getTabletUid()); | ||
| if (tablet == null) { | ||
| tablet = new CachedTablet(); | ||
| System.out.println( | ||
| "DEBUG [BYPASS]: Created new tablet for uid " + tabletIn.getTabletUid()); | ||
| } | ||
| tablet.update(tabletIn); | ||
| System.out.println( | ||
| "DEBUG [BYPASS]: Tablet[" | ||
| + t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When rebuilding the tablet list in a group, CachedTablet objects for tablets that are no longer part of the group are discarded, but their associated ServerEntry objects are not unreferenced. This causes a resource leak. You should track which tablets are removed and call unref on their ServerEntry. A common pattern is to put all old tablets in a map, remove them as they are reused, and then iterate over the remaining tablets in the map to unreference their resources.
| long splitId = 0; | ||
| ByteString generation; | ||
|
|
||
| CachedRange(ByteString startKey, CachedGroup group, long splitId, ByteString generation) { | ||
| this.startKey = startKey; | ||
| this.group = group; | ||
| this.splitId = splitId; | ||
| this.generation = generation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unref(ServerEntry) method is never called, which is the root cause of multiple resource leaks. Additionally, when a ServerEntry's reference count reaches zero, the underlying ManagedChannel should be shut down to release its resources. This is not currently done.
| long splitId = 0; | |
| ByteString generation; | |
| CachedRange(ByteString startKey, CachedGroup group, long splitId, ByteString generation) { | |
| this.startKey = startKey; | |
| this.group = group; | |
| this.splitId = splitId; | |
| this.generation = generation; | |
| private void unref(ServerEntry serverEntry) { | |
| if (serverEntry == null) { | |
| return; | |
| } | |
| if (--serverEntry.refs == 0) { | |
| servers.remove(serverEntry.server.getAddress()); | |
| serverEntry.server.getChannel().shutdown(); | |
| } | |
| } |
| private void unref(ServerEntry serverEntry) { | ||
| if (serverEntry == null) { | ||
| return; | ||
| } | ||
| if (--serverEntry.refs == 0) { | ||
| servers.remove(serverEntry.server.getAddress()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a CachedGroup's reference count drops to zero, it is removed from the cache, but the ServerEntrys for its tablets are not unreferenced. This will cause a resource leak. Before removing the group, you should iterate through its tablets and unref their servers.
private void unref(CachedGroup group) {
if (group == null) {
return;
}
if (--group.refs == 0) {
for (CachedTablet tablet : group.tablets) {
if (tablet.server != null) {
ServerEntry entry = servers.get(tablet.serverAddress);
unref(entry);
}
}
groups.remove(group.groupUid);
}
}| System.out.println( | ||
| "DEBUG [BYPASS]: Group.update for group " | ||
| + groupUid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit adds the KeyRangeCache class that maps key ranges to specific server locations for routing decisions. Key features: - TabletEntry class for tablet metadata (UID, server address, incarnation) - ServerEntry class for server connection management - Key range to tablet mapping with efficient lookup - Lazy server initialization for on-demand connections - Integration with ChannelFinderServer interfaces This is part of the experimental location-aware routing for improved latency.
f4b90d4 to
34a715d
Compare
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> ☕️
If you write sample code, please follow the samples format.