[Phase 1] Fix Flaky Integration Tests + Migrate some tests to Test Containers#1741
[Phase 1] Fix Flaky Integration Tests + Migrate some tests to Test Containers#1741siri-varma wants to merge 7 commits into
Conversation
89fd3e0 to
c01299d
Compare
|
@javier-aliaga Would you know why this is failing ? did not touch any protos or xml files |
|
@siri-varma , the proto files are generated from master dapr/dapr, every time they change something there it break our master. I already fixed that and it is merged to master |
a9b4907 to
d499f64
Compare
Port the BookTrip compensation (Saga) workflow from the plain Java examples into the Spring Boot workflow patterns module, adding @Component-annotated activities and a /wfp/compensation REST endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Siri Varma Vegiraju <svegiraju@Siris-MacBook-Pro.local>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
d499f64 to
8a7e0d0
Compare
90fa312 to
5c4bb55
Compare
2dea405 to
a341b06
Compare
Signed-off-by: Siri Varma Vegiraju <s_vegiraju@apple.com>
a341b06 to
6ba3643
Compare
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
|
@javier-aliaga need some reviews on this pr please. Mostly test changes.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1741 +/- ##
============================================
- Coverage 76.77% 76.73% -0.04%
- Complexity 2259 2260 +1
============================================
Files 241 241
Lines 7066 7076 +10
Branches 740 742 +2
============================================
+ Hits 5425 5430 +5
- Misses 1277 1282 +5
Partials 364 364 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| }); | ||
|
|
||
| this.receiver = new Thread(() -> { | ||
| int reconnectAttempts = 0; |
There was a problem hiding this comment.
When do we reset this counter?
| @Override | ||
| public void onError(Throwable throwable) { | ||
| listener.onError(DaprException.propagate(throwable)); | ||
| receiverStateChange.release(); |
There was a problem hiding this comment.
This is the right fix but it introduces a problem in line 121.
| this.onError(DaprException.propagate(e)); | ||
| } |
There was a problem hiding this comment.
This call is a sync call, no grpc stream call, this releases the lock but is not telling the stream to be closed.
2 possibilities here;
Do we want to kill the stream because one event fails? or we just notifiy the listener and we keep going?
} catch (Exception e) {
listener.onError(DaprException.propagate(e));
// optionally ack as RETRY so the broker can redeliver
// do NOT call this.onError and do NOT release the semaphore
}
If we want to close the stream then
} catch (Exception e) {
var s = streamRef.get();
if (s != null) {
try {
s.onError(DaprException.propagate(e)); // client-side cancel of the gRPC stream
} catch (Throwable ignore) {
// already terminated
}
}
// do NOT call this.onError(...) or release the semaphore here —
// gRPC will invoke the server→client onError callback on cancellation,
// which is the path at line 126, and that one releases the semaphore.
}
I prefer the listener option, wdyt?
| } | ||
|
|
||
| if (running.get()) { | ||
| long backoffMs = Math.min(1000L * (1L << reconnectAttempts), 30000L); |
There was a problem hiding this comment.
if the server is broken for a long period and you get the attemp 63 then it becomes negative causing this to throw an kill the receiver thread, so we may need to cap it to 5
int safeAttempts = Math.min(reconnectAttempts, 5);
long backoffMs = Math.min(1000L * (1L << safeAttempts), 30000L);
5 attempts = 32000 ms
There was a problem hiding this comment.
Pull request overview
This PR targets CI reliability by reducing flakiness in Java SDK integration tests and migrating selected suites to Testcontainers-based environments, while also adjusting runtime/test harness behavior to better tolerate startup/teardown race conditions.
Changes:
- Hardened Jobs and Pub/Sub integration tests (unique resource names + Awaitility/polling) to reduce timing-related flakes.
- Migrated/added several integration test suites to Testcontainers and removed older
dapr run-based equivalents. - Updated shared test harness + resiliency tests (WireMock dynamic port, health/stop behavior tweaks) and added subscription reconnect backoff.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spring-boot-4-sdk-tests/src/test/java/io/dapr/it/springboot4/testcontainers/jobs/DaprJobsIT.java | Makes job tests less flaky via unique job names, future dueTime, and Awaitility polling. |
| sdk/src/main/java/io/dapr/client/Subscription.java | Adds reconnect backoff and releases receiver semaphore on errors for streaming subscriptions. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/secrets/DaprSecretsIT.java | Adds a Testcontainers-based Secrets API IT (currently disabled pending investigation). |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/stream/DaprPubSubStreamIT.java | Adds Testcontainers-based streaming pubsub tests using the preview client and readiness probing. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/outbox/DaprPubSubOutboxIT.java | Adjusts container lifecycle and extends polling window to mitigate intermittent timing failures (test remains disabled). |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/SubscriberController.java | Updates typed handlers to use local DTOs instead of referencing removed legacy test classes. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/MyObject.java | Introduces local DTO used by Testcontainers Pub/Sub HTTP tests. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/DaprPubSubIT.java | Replaces legacy nested DTO usage with new local DTOs and keeps existing pubsub assertions. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/ConvertToLong.java | Introduces local DTO used by long-value Pub/Sub HTTP tests. |
| sdk-tests/src/test/java/io/dapr/it/testcontainers/jobs/DaprJobsIT.java | Mirrors the Spring Boot 4 jobs test flake fixes (unique names + Awaitility + dueTime adjustments). |
| sdk-tests/src/test/java/io/dapr/it/state/HelloWorldGrpcStateService.java | Removes legacy state “hello world” example service used by older integration tests. |
| sdk-tests/src/test/java/io/dapr/it/state/HelloWorldClientIT.java | Removes legacy state integration test reliant on the removed hello world service. |
| sdk-tests/src/test/java/io/dapr/it/resiliency/SdkResiliencyIT.java | Switches from fixed-port WireMockTest to WireMockExtension dynamic port + explicit DaprContainer lifecycle. |
| sdk-tests/src/test/java/io/dapr/it/pubsub/stream/PubSubStreamIT.java | Removes legacy non-Testcontainers streaming pubsub integration tests. |
| sdk-tests/src/test/java/io/dapr/it/pubsub/http/SubscriberService.java | Removes legacy subscriber Spring Boot service used by non-Testcontainers pubsub tests. |
| sdk-tests/src/test/java/io/dapr/it/pubsub/http/SubscriberController.java | Removes legacy subscriber controller used by non-Testcontainers pubsub tests. |
| sdk-tests/src/test/java/io/dapr/it/pubsub/http/PubSubIT.java | Removes legacy non-Testcontainers pubsub integration test suite. |
| sdk-tests/src/test/java/io/dapr/it/DaprRun.java | Tweaks dapr stop handling and replaces fixed sleeps with bounded health polling to reduce flakes. |
| sdk-tests/src/test/java/io/dapr/it/actors/ActorTimerRecoveryIT.java | Replaces fixed sleeps with longer retry windows for timer recovery validation. |
| sdk-tests/src/test/java/io/dapr/it/actors/ActorReminderRecoveryIT.java | Replaces fixed sleeps with longer retry windows for reminder recovery validation. |
Comments suppressed due to low confidence (1)
sdk-tests/src/test/java/io/dapr/it/DaprRun.java:246
- waitForAppHealth’s HTTP path now uses a fixed 5s connectTimeout and no per-request timeout; a single client.send() can therefore block longer than maxWaitMilliseconds, defeating the method’s timeout contract (especially when maxWaitMilliseconds < 5000). Consider setting HttpRequest timeout (or using sendAsync with an explicit timeout) and deriving connectTimeout from the remaining time budget.
long maxWait = System.currentTimeMillis() + maxWaitMilliseconds;
HttpClient client = HttpClient.newBuilder()
.version(HttpClient.Version.HTTP_1_1)
.connectTimeout(Duration.ofSeconds(5))
.build();
String url = "http://127.0.0.1:" + this.getAppPort() + "/health";
HttpRequest request = HttpRequest.newBuilder()
.GET()
.uri(URI.create(url))
.build();
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (running.get()) { | ||
| long backoffMs = Math.min(1000L * (1L << reconnectAttempts), 30000L); | ||
| try { | ||
| Thread.sleep(backoffMs); | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| running.set(false); | ||
| } | ||
| reconnectAttempts++; |
| } catch (RuntimeException e) { | ||
| System.out.println("Could not stop app " + this.appName + ": " + e.getMessage()); | ||
| if (e.getMessage() != null && e.getMessage().contains("Could not find success criteria")) { | ||
| System.out.println("App " + this.appName + " already stopped or not found (ignored)."); | ||
| } else { | ||
| System.out.println("Could not stop app " + this.appName + ": " + e.getMessage()); | ||
| } |
Description
Please explain the changes you've made
Issue reference
We strive to have all PR being opened based on an issue, where the problem or feature have been discussed prior to implementation.
Please reference the issue this PR will close:
#1732
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list: