-
Notifications
You must be signed in to change notification settings - Fork 65
feat(flagd): introduce fatalStatusCodes option #1624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(flagd): introduce fatalStatusCodes option #1624
Conversation
...lagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/config/ConfigSteps.java
Outdated
Show resolved
Hide resolved
...ers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/ProviderSteps.java
Outdated
Show resolved
Hide resolved
providers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/Utils.java
Outdated
Show resolved
Hide resolved
providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdOptions.java
Outdated
Show resolved
Hide resolved
6f89ff0 to
057751b
Compare
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
f7f1d97 to
f0a1db2
Compare
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
chrfwow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not want to introduce breaking changes into the api by adding a PROVIDER_FATAL type to ProviderEvent, I have two suggestions how we might be able to work around the "misuse" of the stale event:
We could add a isFatal flag to the FlagdProviderEvent to track the type of error. I don't really like it because this could also be set when the event is not an error event, and with this we split up information that should be stored in one place into two places.
Or, we create an enum class ExtendedProviderEvent, which is a copy of ProviderEvent (enums cannot be extended in Java), plus the additional PROVIDER_FATAL field. We would then have to map where needed between the two types (not 100% sure if this will work). I don't like this either, because we would duplicate the ProviderEvent enum
| private final BlockingQueue<QueuePayload> outgoingQueue = new LinkedBlockingQueue<>(QUEUE_SIZE); | ||
| private final FlagSyncServiceStub flagSyncStub; | ||
| private final FlagSyncServiceBlockingStub metadataStub; | ||
| private final List<String> fatalStatusCodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do lots of .contains operation on this data structure, a HashSet might be more performant. How many entries do we expect in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's hard for me to estimate, what do the others think? The currently defined default is an empty list
...rs/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/resolver/rpc/RpcResolver.java
Show resolved
Hide resolved
| .map(String::trim) | ||
| .collect(Collectors.toList()) : defaultValue; | ||
| } catch (Exception e) { | ||
| return defaultValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should print an info/warn that the env vars are invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for this method? Or the other ones too? I'd either leave it or add it in all cases to be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should add it everywhere, but in a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, sounds good. Should we create a new issue for this or is that overkill?
providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdOptions.java
Outdated
Show resolved
Hide resolved
providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdProvider.java
Outdated
Show resolved
Hide resolved
This is an implication of our provider design and there is not really something to do about that (in this PR). |
| } | ||
| break; | ||
| case ERROR: | ||
| if (!stateBlockingQueue.offer(new StorageStateChange(StorageState.STALE))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simply add a FATAL storage state to resolve this conceptual "STALE" overloading? This is an entirely private enum, so we can add to it without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could, but this would only solve the missuse issue in the communication step from FlagStore -> InProcessResolver (and not InProcessResolver -> FlagdProvider)
Also, nit pick: StorageState.ERROR is already defined as /** Storage is in an unrecoverable error stage. */, which models what FATAL means for us i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the StorageState existing docs the states are used now as intended:
STALE: Storage has gone stale (most recent sync failed). May get to OK status with next sync.
QueueSource encountered an ERROR but will try to recover.
ERROR: Storage is in an unrecoverable error stage.
QueueSource encountered an ERROR AND will not try to recover. It exited the sync loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fixed this by propagating the provider details as well though the consumer, so we don't need to overload or abuse the enums anymore.
|
|
||
| private final AtomicBoolean shutdown = new AtomicBoolean(false); | ||
| private final AtomicBoolean shouldThrottle = new AtomicBoolean(false); | ||
| private final AtomicBoolean successfulSync = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest question with this whole concept (not your implementation) is whether or not we should care about whether this is the initial sync or not. I'm actually leaning towards "not"... here is my reasoning (anyone feel free to disagree):
- users already fully control what's considered
FATAL; they can also control whether or not to consider FATAL at init different than FATAL later, using event handlers and the details of the exception - it's simpler (less conditions/state to handler in our code [this field would disappear] and for users to understand)
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's easy to get the same behaviour through event handlers, I think that might be better, because it allows for more customization. I get both sides, that one might not want to completely shut down if a valid flag config was previously received, but also that one might not want to work with stale data given that a non-transient error was received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @toddbaert that fatality does not depend on being the first sync or not.
However, fatality from the SyncStreamQueueSource's perspective depends on whether we break out the sync loop and stop the sync thread (FATAL ERROR), or continue to reconnect (ERROR).
I could be missing something, but I don't think this is an issue. The "fatalness" (fatality?) of an event is not communicated by the event type, but the error code associated with the event: https://github.com/open-feature/java-sdk/blob/main/src/main/java/dev/openfeature/sdk/ProviderEventDetails.java#L16. All error events are events. Some error events contain with This is basically what's in Go, as well, right @alexandraoberaigner ? |
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
341d1e1 to
94c7691
Compare
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
The issue is that the Personally I think I prefer adding a |
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Yes, in Go its a provider event with error code: see ProviderInitError with ErrorCode including PROVIDER_FATAL Just found a test in the java-sdk that says how the fatal error should look like: see here |
Yup, I also implemented it like this when the FlagdProvider emits the fatal error here |
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
...main/java/dev/openfeature/contrib/providers/flagd/resolver/process/storage/StorageState.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
toddbaert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few small test fixes.
Looks good to me, and FATAL works as expected. One thing to note is that as discussed in various places, we DO NOT act differently on FATAL status codes depending on whether this is the initial connection or not; if the stream sends a FATAL code at any time, the provider transitions to FATAL.
60a282f to
9146dc5
Compare
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
9146dc5 to
a969488
Compare
| @Then("the client should be in {} state") | ||
| public void the_client_should_be_in_fatal_state(String clientState) { | ||
| assertThat(state.client.getProviderState()).isEqualTo(ProviderState.valueOf(clientState.toUpperCase())); | ||
| await().pollDelay(100, TimeUnit.MILLISECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make this test tolerate a small retry/timeout here (it worked most of the time but was flaky for both in-process and RPC)
I think that there's a small lag time between when we fire events, and when the SDK updates the client status. We should consider whether this is a small bug with the SDK, and if we should guarantee that the state is updated before the event handlers run (or not)
This PR
Related Issues
resulted from this issue
Notes
I'm not too happy with how the fatal error is communicated through the different components (received at
SyncStreamQueueSource->FlagStore->InProcessResolver->FlagdProvider, respectiveRpcResolver->FlagdProvider). It "misuses" the STALE state to differentiate between normal errors and fatal errors. I couldn't find a cleaner solution for this unfortunately, so feedback on this would be highly appreciated!Will work on the remaining failing tests once we agree on how to proceed!
Follow-up Tasks
How to test