Skip to content

Conversation

@zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Dec 12, 2025

Description

Current connector client configuration are not optimal, the value are too big for a httpclient, if the remote endpoint performance is not good and holding the request for long time without response, then the connections will be exhausted in the connection pool. We need to set a better value for them. There isn't a silver bullet configuration for all scenarios, but since we're using pooling connections, we don't want the connection waiting too long(connection usually completed in milliseconds if a server is up), so we set the default connection timeout to 1s, we can't guarantee all LLM can respond quickly so we set the read timeout to 10s.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes

    • Adjusted connector timeout defaults (connection now interpreted in milliseconds; read in seconds).
  • New Features

    • Timeout settings clarified with explicit units for clearer configuration.
    • Executors and remote connectors support a standardized close lifecycle for proper resource cleanup.
  • Breaking Changes

    • Public timeout APIs renamed to reflect explicit units.
  • Chores

    • Added an "**/out" ignore pattern.
  • Tests

    • Connector tests and examples updated to include client read_timeout (60s).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

Walkthrough

Timeout fields were renamed to explicit units (connectionTimeoutMillis, readTimeoutSeconds) with updated serialization and tests. AbstractConnectorExecutor now lazily creates, caches, and closes an HTTP client; several concrete executors removed their own client caching. RemoteConnectorExecutor extends SdkAutoCloseable. .gitignore ignores **/out.

Changes

Cohort / File(s) Summary
Repo config
\.gitignore
Adds **/out to ignore directories named out at any depth.
Connector model & tests
**/common/src/main/java/**/ConnectorClientConfig.java, **/common/src/test/java/**/ConnectorClientConfigTest.java, **/common/src/test/java/**/transport/connector/MLCreateConnectorInputTests.java
Renamed timeout fields/getters to connectionTimeoutMillis and readTimeoutSeconds; updated builder setters, default constants (primitive int), StreamInput/StreamOutput, toXContent/parse, and tests to reflect unit/name changes.
Executor base & lifecycle
**/ml-algorithms/src/main/java/**/AbstractConnectorExecutor.java
Added connectorClientConfig init, connectorPrivateIpEnabled flag, AtomicReference<SdkAsyncHttpClient> cache, a synchronized protected getHttpClient() that builds via MLHttpClientFactory, and a public close() to shut down/clear the cached client.
Concrete executors (removed client caching)
**/ml-algorithms/src/main/java/**/AwsConnectorExecutor.java, **/ml-algorithms/src/main/java/**/HttpJsonConnectorExecutor.java
Removed per-class httpClientRef, getHttpClient(), connectorPrivateIpEnabled, and MLHttpClientFactory-related imports/usages.
Timeout usages updated
**/ml-algorithms/src/main/java/**/McpConnectorExecutor.java, **/ml-algorithms/src/main/java/**/McpStreamableHttpConnectorExecutor.java, **/ml-algorithms/src/main/java/**/streaming/HttpStreamingHandler.java
Switched connection-timeout construction to use getConnectionTimeoutMillis() with Duration.ofMillis(...); read-timeout now uses getReadTimeoutSeconds() with Duration.ofSeconds(...).
Tests & interface lifecycle
**/ml-algorithms/src/test/java/**/AbstractConnectorExecutorTest.java, **/ml-algorithms/src/main/java/**/RemoteConnectorExecutor.java, **/ml-algorithms/src/main/java/**/RemoteModel.java
Tests updated for renamed getters; RemoteConnectorExecutor now extends SdkAutoCloseable; RemoteModel.close() calls connectorExecutor.close() if non-null before nulling.
Integration test updates
**/plugin/src/test/java/**/RestMLRAGSearchProcessorIT.java, **/plugin/src/test/java/**/RestMLInferenceSearchRequestProcessorIT.java
Added client_config blocks with read_timeout: 60 entries in connector blueprint test strings.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Exec as AbstractConnectorExecutor
    participant Config as ConnectorClientConfig
    participant Factory as MLHttpClientFactory
    participant Http as SdkAsyncHttpClient

    Exec->>Config: getConnectorClientConfig()
    Exec->>Exec: read maxConnections, connectionTimeoutMillis, readTimeoutSeconds
    alt no cached client
        Exec->>Factory: buildClient(maxConnections, connectionTimeoutMillis, readTimeoutSeconds, connectorPrivateIpEnabled)
        Factory-->>Http: returns SdkAsyncHttpClient
        Exec->>Exec: store Http in AtomicReference
    end
    Exec-->>Http: return cached SdkAsyncHttpClient
    Note right of Exec: close() stops and clears cached Http
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Focus areas:
    • ConnectorClientConfig serialization/parsing and XContent keys vs units (ms vs s).
    • All call sites and tests updated to new builder/getter names and default values.
    • Thread-safety, lazy initialization, caching, and close lifecycle semantics in AbstractConnectorExecutor.
    • Consistency where concrete executors removed local caching and rely on base implementation.
    • Impact of RemoteConnectorExecutor extending SdkAutoCloseable on implementors.

Poem

🐇
I hopped from seconds down to mills,
I build a client only when it thrills,
Tests learned new names and adjusted art,
Executors close gently — a tidy start,
Carrot-cache cleared, I bound and part. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Optimize connector client configuration' directly aligns with the main change: optimizing HTTP client timeout default values in ConnectorClientConfig to prevent connection pool exhaustion.
Description check ✅ Passed The description explains the problem, solution, and rationale, but the 'Related Issues' field contains only a placeholder and no issue number is actually referenced.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a75a9b1 and 094b7e4.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1 hunks)
  • common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: spotless
🔇 Additional comments (1)
common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (1)

162-168: Test default value assertions match the current implementation constants but appear inconsistent with unit expectations.

The test assertions expect connectionTimeout=1 and readTimeout=10, which match the implementation's CONNECTION_TIMEOUT_DEFAULT_VALUE and READ_TIMEOUT_DEFAULT_VALUE constants. However, all other test cases in this file explicitly use 5000 and 3000 for these fields, strongly suggesting millisecond units. If these fields are indeed millisecond-based (as the explicit test values and MLHttpClientFactory's Duration parameters indicate), the default constants should be 1000 and 10000 respectively, and these assertions should be updated accordingly:

-        Assert.assertEquals(Integer.valueOf(1), config.getConnectionTimeout());
-        Assert.assertEquals(Integer.valueOf(10), config.getReadTimeout());
+        Assert.assertEquals(Integer.valueOf(1000), config.getConnectionTimeout());
+        Assert.assertEquals(Integer.valueOf(10000), config.getReadTimeout());

Verify whether the implementation defaults are being corrected as part of this change.

xinyual
xinyual previously approved these changes Dec 12, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java (1)

92-92: Verify test data reflects intended timeout values.

The test constructs ConnectorClientConfig(20, 10000, 10000, 10, 10, -1, RetryBackoffPolicy.CONSTANT), which translates to:

  • connectionTimeoutMillis = 10000 (10 seconds)
  • readTimeoutSeconds = 10000 (10,000 seconds ≈ 2.78 hours)

A read timeout of 2.78 hours seems unrealistic for testing. This may have been copy-pasted from previous test data where both values were in milliseconds.

Consider updating to more realistic test values:

-        ConnectorClientConfig connectorClientConfig = new ConnectorClientConfig(20, 10000, 10000, 10, 10, -1, RetryBackoffPolicy.CONSTANT);
+        ConnectorClientConfig connectorClientConfig = new ConnectorClientConfig(20, 10000, 30, 10, 10, -1, RetryBackoffPolicy.CONSTANT);

This would set:

  • Connection timeout: 10000ms (10 seconds)
  • Read timeout: 30 seconds (more reasonable for testing)
🧹 Nitpick comments (1)
common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1)

31-32: Consider adding unit suffixes to JSON field name constants for clarity.

The Java field names now include unit suffixes (connectionTimeoutMillis, readTimeoutSeconds), but the JSON field name constants (CONNECTION_TIMEOUT_FIELD, READ_TIMEOUT_FIELD) do not. This creates potential confusion for API consumers who need to know that connection_timeout expects milliseconds while read_timeout expects seconds.

Consider renaming to make units explicit:

-    public static final String CONNECTION_TIMEOUT_FIELD = "connection_timeout";
-    public static final String READ_TIMEOUT_FIELD = "read_timeout";
+    public static final String CONNECTION_TIMEOUT_FIELD = "connection_timeout_millis";
+    public static final String READ_TIMEOUT_FIELD = "read_timeout_seconds";

Alternatively, document the expected units in comments if changing field names would break API compatibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 094b7e4 and 86c4a29.

📒 Files selected for processing (10)
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (7 hunks)
  • common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (9 hunks)
  • common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (2 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java (0 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java (0 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/streaming/HttpStreamingHandler.java (1 hunks)
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java (1 hunks)
💤 Files with no reviewable changes (2)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: spotless
🔇 Additional comments (8)
common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (2)

38-40: Verify timeout defaults align with intended behavior.

The new defaults set connection timeout to 1000ms (1 second) and read timeout to 10 seconds. While this aligns with the PR objectives to prevent connection pool exhaustion, a 1-second connection timeout may be too aggressive for some network conditions.

Please verify these values have been tested against:

  1. Various network latency conditions (high-latency networks, VPCs across regions)
  2. Cold-start scenarios where remote endpoints may take longer to establish connections
  3. Existing deployments to ensure backward compatibility

Based on the PR objectives and past review comments, this addresses the unit ambiguity issue raised previously.


47-48: LGTM! Field renaming consistently applied.

The renaming from connectionTimeout/readTimeout to connectionTimeoutMillis/readTimeoutSeconds is consistently applied across:

  • Field declarations
  • Constructor parameters
  • Builder methods
  • Serialization/deserialization
  • JSON output (toXContent)
  • Parsing logic

This resolves the unit ambiguity identified in previous reviews by making units explicit in the field names.

Also applies to: 57-58, 65-66, 76-77, 90-91, 102-103, 123-127, 193-194

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java (1)

77-78: LGTM! Timeout configuration correctly updated.

The timeout configuration properly uses:

  • Duration.ofMillis() for connectionTimeoutMillis (milliseconds)
  • Duration.ofSeconds() for readTimeoutSeconds (seconds)

This correctly implements the new API from ConnectorClientConfig.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java (1)

74-75: LGTM! Timeout configuration correctly updated.

The timeout configuration properly uses:

  • Duration.ofMillis() for connectionTimeoutMillis (milliseconds)
  • Duration.ofSeconds() for readTimeoutSeconds (seconds)

This correctly implements the new API from ConnectorClientConfig.

ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java (1)

33-34: LGTM! Tests updated to use new API.

Test assertions correctly updated to validate:

  • getConnectionTimeoutMillis() returns CONNECTION_TIMEOUT_DEFAULT_VALUE
  • getReadTimeoutSeconds() returns READ_TIMEOUT_DEFAULT_VALUE

This ensures the new API is properly tested.

Also applies to: 42-43

common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java (1)

327-328: LGTM! Assertions updated to use new API.

Test assertions correctly updated to use getReadTimeoutSeconds() and getConnectionTimeoutMillis().

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

38-50: LGTM! Thread-safe lazy HTTP client initialization.

The new getHttpClient() method properly implements lazy initialization with:

  • Thread-safe AtomicReference.compareAndSet() for concurrent access
  • Correct timeout conversion: Duration.ofMillis() for connection timeout, Duration.ofSeconds() for read timeout
  • Integration with MLHttpClientFactory for standardized client creation

The addition of connectorPrivateIpEnabled provides useful configuration flexibility.

common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (1)

27-28: LGTM! Comprehensive test coverage for refactored API.

Tests thoroughly cover:

  • Builder API with new method names (connectionTimeoutMillis(), readTimeoutSeconds())
  • Serialization/deserialization with new getters
  • JSON output validation
  • Parsing with correct unit semantics
  • Default value initialization

All assertions correctly validate the new timeout field semantics.

Also applies to: 58-59, 74-75, 87-88, 120-121, 150-151, 163-164

@zane-neo zane-neo force-pushed the optimize-client-configuration branch from 86c4a29 to 9427c33 Compare December 12, 2025 04:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

50-55: Use atomic get-and-set to prevent double-close.

The current implementation allows concurrent close() calls to retrieve the same client reference and attempt to close it multiple times.

 public void close() {
-    SdkAsyncHttpClient httpClient = httpClientRef.get();
+    SdkAsyncHttpClient httpClient = httpClientRef.getAndSet(null);
     if (httpClient != null) {
         httpClient.close();
     }
 }
common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1)

117-141: Consider documenting the unit semantics for JSON field values.

The JSON field names (connection_timeout, read_timeout) don't indicate units, but they now have different expected units (milliseconds vs seconds). While this maintains backward compatibility, users configuring these values may expect consistent units.

Consider adding Javadoc or updating user-facing documentation to clarify that connection_timeout expects milliseconds and read_timeout expects seconds.

OpenSearch ml-commons ConnectorClientConfig connection_timeout read_timeout documentation
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8f29f8 and a4ef7cc.

📒 Files selected for processing (13)
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (7 hunks)
  • common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (9 hunks)
  • common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java (0 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java (0 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteConnectorExecutor.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteModel.java (1 hunks)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/streaming/HttpStreamingHandler.java (1 hunks)
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java (1 hunks)
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLRAGSearchProcessorIT.java (1 hunks)
💤 Files with no reviewable changes (2)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java
🚧 Files skipped from review as they are similar to previous changes (5)
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLRAGSearchProcessorIT.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/streaming/HttpStreamingHandler.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java
  • common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java
  • common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test MLCommons Plugin on linux (25)
  • GitHub Check: Build and Test MLCommons Plugin on linux (21)
🔇 Additional comments (4)
ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java (1)

28-44: LGTM!

Test assertions correctly updated to use the renamed getters (getConnectionTimeoutMillis() and getReadTimeoutSeconds()), and they properly validate that default values are applied when the connector client config is null or has null field values.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java (1)

74-75: LGTM!

Timeout durations correctly use the new getter methods with appropriate units: Duration.ofMillis() for connection timeout and Duration.ofSeconds() for read timeout, aligning with the renamed field semantics.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteConnectorExecutor.java (1)

64-71: LGTM!

Extending SdkAutoCloseable is appropriate for enforcing resource lifecycle management. The Javadoc clearly documents the responsibility for closing the internal HTTP client when the executor is closed.

common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1)

38-40: LGTM! Default values align with PR objectives.

Connection timeout defaults to 1000ms (1 second) and read timeout to 10 seconds, which addresses the pool exhaustion concern when remote endpoints are slow.

Comment on lines +47 to 49
private Integer connectionTimeoutMillis;
private Integer readTimeoutSeconds;
private Integer retryBackoffMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use milliseconds for both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, sounds very confusing to use different way to configure these timeouts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In documentation, we specified that connection timeout is in millisecond and read timeout in second: https://docs.opensearch.org/latest/ml-commons-plugin/remote-models/blueprints/#configuration-parameters, changing this could break existing customer configurations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opensearch-project/documentation-website#9460

based on this issue, the documentation was changed. Let's make the values consistent. I don't have preference about seconds vs miliseconds. But keep this consistent please. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can change this to milliseconds but this could be a breaking change for existing customer connector configurations, e.g. user configuring to 3s now become 3 milliseconds which basically means all remote model calls fails, @ylwu-amzn , please also confirm on this.

@dhrubo-os dhrubo-os added the bug Something isn't working label Dec 22, 2025
Comment on lines 39 to 40
public static final Integer CONNECTION_TIMEOUT_DEFAULT_VALUE = 1000;
public static final Integer READ_TIMEOUT_DEFAULT_VALUE = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally we had 30 seconds for both of these cases. Can we get back to the original state? Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some evidence that connection/read timeout both setting to 30s is a good option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have evidence handy. But my point is this was 30s before until the bug we introduced. And to fix the bug if we change the value drastically that seems a breaking change to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not targeting for reverting code back, its purpose is to address some issues.

Screenshot 2025-12-23 at 10 53 50 This is a general timeout setting given by Claude, basically we shouldn't set connection timeout 30s that big in our case.

A good user experience of the E2E latency better less than 10s based on this article: https://www.nngroup.com/articles/response-times-3-important-limits. Note that this is only an internal http call instead of E2E latency.

From retry perspective, if retry mechanism is on and if a connection can't established and we wait for 30s, let's say we have 3 retries and that's 90s, I don't think user would like wait that long.

From serverless case, SGW's read timeout is 15s so if this number is bigger than 15s then SGW timed out first.

So all above I prefer to keep the current settings, if you still have concern please give some proof that we have to do that way, thanks!

Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
…e IT

Signed-off-by: zane-neo <zaniu@amazon.com>
@zane-neo zane-neo force-pushed the optimize-client-configuration branch from a4ef7cc to 3046388 Compare December 22, 2025 05:45
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env December 22, 2025 05:47 — with GitHub Actions Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java (1)

70-71: Test data inconsistency: read_timeout value of 10000 is interpreted as seconds, not milliseconds.

This issue was flagged in a previous review and appears unresolved. With the new API where readTimeoutSeconds is in seconds:

  1. Line 70: "read_timeout":10000 means 10,000 seconds (~2.7 hours), not the intended 10 seconds
  2. Line 92: Constructor param 10000 for readTimeoutSeconds should be 10
  3. Line 327: Assertion expects 10000 for getReadTimeoutSeconds() which contradicts the PR goal of 10-second read timeout
🔎 Proposed fix
 private final String expectedInputStr = "{\"name\":\"test_connector_name\","
     ...
     + "\"access_mode\":\"PUBLIC\",\"client_config\":{\"max_connection\":20,"
-    + "\"connection_timeout\":10000,\"read_timeout\":10000,"
+    + "\"connection_timeout\":10000,\"read_timeout\":10,"
     + "\"retry_backoff_millis\":10,\"retry_timeout_seconds\":10,\"max_retry_times\":-1,\"retry_backoff_policy\":\"constant\"}}";
-ConnectorClientConfig connectorClientConfig = new ConnectorClientConfig(20, 10000, 10000, 10, 10, -1, RetryBackoffPolicy.CONSTANT);
+ConnectorClientConfig connectorClientConfig = new ConnectorClientConfig(20, 10000, 10, 10, 10, -1, RetryBackoffPolicy.CONSTANT);
-    assertEquals(10000, parsedInput.getConnectorClientConfig().getReadTimeoutSeconds().intValue());
+    assertEquals(10, parsedInput.getConnectorClientConfig().getReadTimeoutSeconds().intValue());

Also applies to: 92-92, 327-328

🧹 Nitpick comments (2)
plugin/src/test/java/org/opensearch/ml/rest/RestMLInferenceSearchRequestProcessorIT.java (1)

54-56: The read_timeout field name is correct and 60 seconds is appropriate for the OpenAI connector.

The field name aligns with the ConnectorClientConfig schema where "read_timeout" is the correct JSON serialization name. The 60-second timeout is appropriate for external API integration tests, allowing adequate time for OpenAI API responses.

Optionally, consider adding explicit client_config timeouts to the Bedrock connectors (bedrockEmbeddingModelConnectorEntity and bedrockMultiModalEmbeddingModelConnectorEntity) for consistency, as they also invoke external services. However, this is not required if the PR scope is limited to the OpenAI connector.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

26-26: Redundant volatile modifier on AtomicReference.

AtomicReference already provides thread-safe visibility guarantees through its internal volatile field. The volatile modifier on the reference variable itself is unnecessary since the reference is never reassigned after initialization.

🔎 Suggested fix
-    private volatile AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
+    private final AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4ef7cc and 3046388.

📒 Files selected for processing (15)
  • .gitignore
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java
  • common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java
  • common/src/test/java/org/opensearch/ml/common/transport/connector/MLCreateConnectorInputTests.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteModel.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/streaming/HttpStreamingHandler.java
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLInferenceSearchRequestProcessorIT.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLRAGSearchProcessorIT.java
💤 Files with no reviewable changes (2)
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AwsConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/HttpJsonConnectorExecutor.java
🚧 Files skipped from review as they are similar to previous changes (7)
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/streaming/HttpStreamingHandler.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/McpStreamableHttpConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteModel.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLRAGSearchProcessorIT.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteConnectorExecutor.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-18T16:02:36.322Z
Learnt from: akolarkunnu
Repo: opensearch-project/ml-commons PR: 3919
File: plugin/src/test/java/org/opensearch/ml/cluster/MLSyncUpCronTests.java:184-191
Timestamp: 2025-12-18T16:02:36.322Z
Learning: In tests that exercise initialization logic (e.g., initMLConfig), verify idempotence by calling the init method twice and asserting that the master key is initialized once and cached for subsequent calls. The test should confirm that repeated initializations yield the same result and no unnecessary recomputation. Apply this pattern to similar test files under plugin/src/test/java/org/opensearch/ml/cluster and other modules that initialize shared state.

Applied to files:

  • plugin/src/test/java/org/opensearch/ml/rest/RestMLInferenceSearchRequestProcessorIT.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test MLCommons Plugin on linux (21)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (21)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (25)
  • GitHub Check: Build and Test MLCommons Plugin on linux (25)
🔇 Additional comments (4)
.gitignore (1)

14-14: Already addressed in prior review.

This line was flagged in a previous review and marked as addressed in commits 1eb22e6 to 3046388. No new concerns identified.

common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1)

38-48: Field renaming improves internal clarity; JSON API unit ambiguity remains a documentation concern.

The renaming of connectionTimeoutconnectionTimeoutMillis and readTimeoutreadTimeoutSeconds with corresponding default values (1000ms and 10s) makes the internal API explicit about units. This aligns with the PR objective of reducing long-held connections.

Note that the JSON serialization still uses connection_timeout and read_timeout without unit suffixes (lines 31-32, 123-127), which relies on documentation to clarify the mixed units. Per the author's earlier comment, this is intentional for backward compatibility with existing configurations.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

36-56: Double-checked locking pattern correctly addresses the previous race condition.

The implementation properly uses synchronization to prevent multiple HTTP client instances from being created, with a fast path for already-initialized clients. This addresses the race condition flagged in previous reviews.

common/src/test/java/org/opensearch/ml/common/connector/ConnectorClientConfigTest.java (1)

27-28: Test expectations correctly aligned with new timeout unit semantics.

The tests properly use:

  • connectionTimeoutMillis() with millisecond values (1000, 5000)
  • readTimeoutSeconds() with second values (3, 10)
  • Assertions via getConnectionTimeoutMillis() and getReadTimeoutSeconds()

This correctly validates the renamed API and default values (1000ms connection timeout, 10s read timeout).

Also applies to: 87-88, 120-121, 163-164

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

27-27: Remove redundant volatile modifier on AtomicReference.

The volatile keyword on AtomicReference is redundant because AtomicReference already provides volatile semantics for the reference it holds. The atomic operations (get, set, compareAndSet) ensure proper visibility across threads.

🔎 Proposed fix
-    private volatile AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
+    private AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3046388 and bd31362.

📒 Files selected for processing (2)
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (21)
  • GitHub Check: Build and Test MLCommons Plugin on linux (25)
  • GitHub Check: Build and Test MLCommons Plugin on linux (21)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (25)
🔇 Additional comments (12)
ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (4)

8-18: LGTM! Appropriate imports for HTTP client management.

The new imports support the lazy HTTP client initialization pattern and AWS SDK v2 integration.


31-35: LGTM! Simplified initialization logic.

The method now relies on the default ConnectorClientConfig instance initialized at declaration, which is cleaner than the previous else-branch approach.


37-69: LGTM! Thread-safe lazy initialization implemented correctly.

The double-checked locking pattern is properly implemented:

  • Fast path avoids synchronization overhead after initialization
  • Synchronized block ensures only one thread creates the HTTP client
  • Optional.ofNullable with defaults prevents NPE on null configuration values
  • Atomic reference setting ensures visibility

Previous concerns about race conditions and NPE have been effectively addressed.


71-76: LGTM! Proper resource cleanup with use-after-close prevention.

The getAndSet(null) approach atomically clears the reference while retrieving the client for cleanup, preventing any thread from obtaining a closed client after close() is invoked.

common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (8)

38-40: LGTM! Default values optimized per PR objectives.

The updated defaults align with the PR's goal to prevent connection pool exhaustion:

  • Connection timeout: 1000ms (1 second) - down from previous unclear default
  • Read timeout: 10 seconds - maintains reasonable duration for remote calls
  • Type change to int primitives is appropriate for constants

These shorter timeouts reduce the risk of long-held pooled connections when remote endpoints perform poorly.


47-48: LGTM! Explicit field naming eliminates unit ambiguity.

Renaming the fields to include unit suffixes (connectionTimeoutMillis, readTimeoutSeconds) clarifies the expected units and aligns with the existing pattern used by retryBackoffMillis and retryTimeoutSeconds. This addresses the unit confusion flagged in previous reviews while maintaining backward compatibility through unchanged JSON field names.


54-71: LGTM! Builder updated consistently with field renaming.

The builder parameters and assignments reflect the new field names, providing clear unit semantics at the API level.


73-86: LGTM! Deserialization updated consistently.

The StreamInput constructor reads timeout values into the renamed fields, maintaining proper binary serialization compatibility.


88-96: LGTM! Default constructor initializes with optimized values.

The no-arg constructor ensures all fields have sensible defaults, preventing null values and providing the optimized timeout configuration by default.


98-115: LGTM! Serialization updated consistently.

The writeTo method serializes the renamed fields in the same order, maintaining binary protocol compatibility with the deserialization path.


117-142: LGTM! JSON serialization maintains backward compatibility.

The toXContent method outputs using the original JSON field names (connection_timeout, read_timeout), ensuring existing configurations and API clients continue to work. The internal field renaming doesn't break the external contract.


149-200: LGTM! Parser maintains backward compatibility while mapping to renamed fields.

The parse method preserves backward compatibility by:

  • Reading from unchanged JSON field names (connection_timeout, read_timeout)
  • Mapping parsed values to the renamed builder methods (connectionTimeoutMillis, readTimeoutSeconds)
  • Relying on documented unit semantics (milliseconds for connection timeout, seconds for read timeout)

Existing configurations will continue to parse correctly, as the external JSON contract remains stable while internal naming improves clarity.

Signed-off-by: zane-neo <zaniu@amazon.com>
@zane-neo zane-neo force-pushed the optimize-client-configuration branch from bd31362 to 1712802 Compare December 22, 2025 07:06
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env December 22, 2025 07:07 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env December 22, 2025 07:07 — with GitHub Actions Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (1)

27-27: Optional: Remove redundant volatile modifier.

The volatile keyword on httpClientRef is unnecessary—AtomicReference already provides the required memory visibility and happens-before guarantees through its atomic operations. The field reference itself is never reassigned (it's effectively final), so volatile has no effect.

🔎 Suggested simplification
-    private volatile AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
+    private final AtomicReference<SdkAsyncHttpClient> httpClientRef = new AtomicReference<>();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd31362 and 1712802.

📒 Files selected for processing (5)
  • common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java
  • ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteModel.java
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLInferenceSearchRequestProcessorIT.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • ml-algorithms/src/test/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutorTest.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLInferenceSearchRequestProcessorIT.java
🧰 Additional context used
🧬 Code graph analysis (1)
ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (2)
common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (1)
  • Getter (26-201)
common/src/main/java/org/opensearch/ml/common/connector/AbstractConnector.java (1)
  • Getter (30-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (21)
  • GitHub Check: Build and Test MLCommons Plugin on linux (21)
  • GitHub Check: Build and Test MLCommons Plugin on linux (25)
  • GitHub Check: Build and Test MLCommons Plugin on Windows (25)
🔇 Additional comments (5)
common/src/main/java/org/opensearch/ml/common/connector/ConnectorClientConfig.java (2)

38-43: LGTM! Timeout field renaming and defaults clarify units.

The renaming from connectionTimeout/readTimeout to connectionTimeoutMillis/readTimeoutSeconds explicitly communicates units, resolving the critical ambiguity flagged in past reviews. The updated defaults (1000ms for connection establishment, 10s for read) are sensible for pooled HTTP clients and align with the PR's goal of preventing connection pool exhaustion when endpoints are slow.

Also applies to: 47-48


31-32: Backwards-compatible serialization maintained.

The serialization field names (CONNECTION_TIMEOUT_FIELD, READ_TIMEOUT_FIELD) remain unchanged while internal field names and getters have been updated with unit suffixes. This preserves wire compatibility with existing configurations while improving code clarity.

Also applies to: 123-127, 167-171, 193-194

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/RemoteModel.java (1)

99-104: LGTM! Null-check prevents NPE in close().

The guard ensures close() is safe to call even if the executor was never initialized, addressing the issue flagged in previous reviews.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/remote/AbstractConnectorExecutor.java (2)

37-69: LGTM! Thread-safe lazy initialization with proper null-safety.

The double-checked locking pattern is correctly implemented:

  • Fast path (lines 39-42) avoids synchronization after initialization
  • Synchronized slow path (lines 44-68) ensures only one thread creates the client
  • Optional.ofNullable().orElse() pattern prevents NPEs when config values are null, addressing the concern from previous reviews

The Duration conversions correctly interpret connectionTimeoutMillis as milliseconds and readTimeoutSeconds as seconds, aligning with the renamed getters in ConnectorClientConfig.


71-76: LGTM! Atomic close prevents use-after-close.

Using getAndSet(null) atomically clears the reference while retrieving the old client for cleanup, preventing subsequent getHttpClient() calls from returning a closed client. This addresses the use-after-close issue flagged in previous reviews.

@zane-neo zane-neo deployed to ml-commons-cicd-env December 22, 2025 08:38 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants