Skip to content

[Phase 1] Fix Flaky Integration Tests + Migrate some tests to Test Containers#1741

Open
siri-varma wants to merge 7 commits into
dapr:masterfrom
siri-varma:users/svegiraju/fix-integ-tests
Open

[Phase 1] Fix Flaky Integration Tests + Migrate some tests to Test Containers#1741
siri-varma wants to merge 7 commits into
dapr:masterfrom
siri-varma:users/svegiraju/fix-integ-tests

Conversation

@siri-varma
Copy link
Copy Markdown
Contributor

@siri-varma siri-varma commented May 10, 2026

Description

Please explain the changes you've made

Issue reference

We strive to have all PR being opened based on an issue, where the problem or feature have been discussed prior to implementation.

Please reference the issue this PR will close:
#1732

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

  • Code compiles correctly
  • Created/updated tests
  • Extended the documentation

@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch 4 times, most recently from 89fd3e0 to c01299d Compare May 10, 2026 22:16
@siri-varma
Copy link
Copy Markdown
Contributor Author

@javier-aliaga Would you know why this is failing ? did not touch any protos or xml files

@siri-varma siri-varma marked this pull request as ready for review May 11, 2026 14:34
@siri-varma siri-varma requested review from a team as code owners May 11, 2026 14:34
@javier-aliaga
Copy link
Copy Markdown
Contributor

@siri-varma , the proto files are generated from master dapr/dapr, every time they change something there it break our master. I already fixed that and it is merged to master

@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch 8 times, most recently from a9b4907 to d499f64 Compare May 14, 2026 19:07
Siri Varma Vegiraju and others added 5 commits May 14, 2026 12:10
Port the BookTrip compensation (Saga) workflow from the plain Java
examples into the Spring Boot workflow patterns module, adding
@Component-annotated activities and a /wfp/compensation REST endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Siri Varma Vegiraju <svegiraju@Siris-MacBook-Pro.local>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch from d499f64 to 8a7e0d0 Compare May 14, 2026 19:10
@siri-varma siri-varma changed the title Fix Flaky Integration Tests [Phase 1] Fix Flaky Integration Tests + Migrate some tests to Test Containers May 14, 2026
@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch 6 times, most recently from 90fa312 to 5c4bb55 Compare May 15, 2026 14:35
@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch 2 times, most recently from 2dea405 to a341b06 Compare May 15, 2026 15:09
Signed-off-by: Siri Varma Vegiraju <s_vegiraju@apple.com>
@siri-varma siri-varma force-pushed the users/svegiraju/fix-integ-tests branch from a341b06 to 6ba3643 Compare May 15, 2026 15:33
Signed-off-by: Siri Varma Vegiraju <siri.varma@outlook.com>
@siri-varma
Copy link
Copy Markdown
Contributor Author

@javier-aliaga need some reviews on this pr please. Mostly test changes.

  • Reduce flakiness
  • Move to test containers

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.73%. Comparing base (89708ae) to head (fca2573).

Files with missing lines Patch % Lines
sdk/src/main/java/io/dapr/client/Subscription.java 60.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1741      +/-   ##
============================================
- Coverage     76.77%   76.73%   -0.04%     
- Complexity     2259     2260       +1     
============================================
  Files           241      241              
  Lines          7066     7076      +10     
  Branches        740      742       +2     
============================================
+ Hits           5425     5430       +5     
- Misses         1277     1282       +5     
  Partials        364      364              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@siri-varma siri-varma added this to the v1.18 milestone May 18, 2026
});

this.receiver = new Thread(() -> {
int reconnectAttempts = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we reset this counter?

@Override
public void onError(Throwable throwable) {
listener.onError(DaprException.propagate(throwable));
receiverStateChange.release();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right fix but it introduces a problem in line 121.

Comment on lines 121 to 122
this.onError(DaprException.propagate(e));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is a sync call, no grpc stream call, this releases the lock but is not telling the stream to be closed.

2 possibilities here;

Do we want to kill the stream because one event fails? or we just notifiy the listener and we keep going?

} catch (Exception e) {
    listener.onError(DaprException.propagate(e));
    // optionally ack as RETRY so the broker can redeliver
    // do NOT call this.onError and do NOT release the semaphore
  }

If we want to close the stream then

  } catch (Exception e) {
    var s = streamRef.get();
    if (s != null) {
      try {
        s.onError(DaprException.propagate(e));  // client-side cancel of the gRPC stream
      } catch (Throwable ignore) {
        // already terminated
      }
    }
    // do NOT call this.onError(...) or release the semaphore here —
    // gRPC will invoke the server→client onError callback on cancellation,
    // which is the path at line 126, and that one releases the semaphore.
  }

I prefer the listener option, wdyt?

}

if (running.get()) {
long backoffMs = Math.min(1000L * (1L << reconnectAttempts), 30000L);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the server is broken for a long period and you get the attemp 63 then it becomes negative causing this to throw an kill the receiver thread, so we may need to cap it to 5

  int safeAttempts = Math.min(reconnectAttempts, 5);
  long backoffMs = Math.min(1000L * (1L << safeAttempts), 30000L);

5 attempts = 32000 ms

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets CI reliability by reducing flakiness in Java SDK integration tests and migrating selected suites to Testcontainers-based environments, while also adjusting runtime/test harness behavior to better tolerate startup/teardown race conditions.

Changes:

  • Hardened Jobs and Pub/Sub integration tests (unique resource names + Awaitility/polling) to reduce timing-related flakes.
  • Migrated/added several integration test suites to Testcontainers and removed older dapr run-based equivalents.
  • Updated shared test harness + resiliency tests (WireMock dynamic port, health/stop behavior tweaks) and added subscription reconnect backoff.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spring-boot-4-sdk-tests/src/test/java/io/dapr/it/springboot4/testcontainers/jobs/DaprJobsIT.java Makes job tests less flaky via unique job names, future dueTime, and Awaitility polling.
sdk/src/main/java/io/dapr/client/Subscription.java Adds reconnect backoff and releases receiver semaphore on errors for streaming subscriptions.
sdk-tests/src/test/java/io/dapr/it/testcontainers/secrets/DaprSecretsIT.java Adds a Testcontainers-based Secrets API IT (currently disabled pending investigation).
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/stream/DaprPubSubStreamIT.java Adds Testcontainers-based streaming pubsub tests using the preview client and readiness probing.
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/outbox/DaprPubSubOutboxIT.java Adjusts container lifecycle and extends polling window to mitigate intermittent timing failures (test remains disabled).
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/SubscriberController.java Updates typed handlers to use local DTOs instead of referencing removed legacy test classes.
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/MyObject.java Introduces local DTO used by Testcontainers Pub/Sub HTTP tests.
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/DaprPubSubIT.java Replaces legacy nested DTO usage with new local DTOs and keeps existing pubsub assertions.
sdk-tests/src/test/java/io/dapr/it/testcontainers/pubsub/http/ConvertToLong.java Introduces local DTO used by long-value Pub/Sub HTTP tests.
sdk-tests/src/test/java/io/dapr/it/testcontainers/jobs/DaprJobsIT.java Mirrors the Spring Boot 4 jobs test flake fixes (unique names + Awaitility + dueTime adjustments).
sdk-tests/src/test/java/io/dapr/it/state/HelloWorldGrpcStateService.java Removes legacy state “hello world” example service used by older integration tests.
sdk-tests/src/test/java/io/dapr/it/state/HelloWorldClientIT.java Removes legacy state integration test reliant on the removed hello world service.
sdk-tests/src/test/java/io/dapr/it/resiliency/SdkResiliencyIT.java Switches from fixed-port WireMockTest to WireMockExtension dynamic port + explicit DaprContainer lifecycle.
sdk-tests/src/test/java/io/dapr/it/pubsub/stream/PubSubStreamIT.java Removes legacy non-Testcontainers streaming pubsub integration tests.
sdk-tests/src/test/java/io/dapr/it/pubsub/http/SubscriberService.java Removes legacy subscriber Spring Boot service used by non-Testcontainers pubsub tests.
sdk-tests/src/test/java/io/dapr/it/pubsub/http/SubscriberController.java Removes legacy subscriber controller used by non-Testcontainers pubsub tests.
sdk-tests/src/test/java/io/dapr/it/pubsub/http/PubSubIT.java Removes legacy non-Testcontainers pubsub integration test suite.
sdk-tests/src/test/java/io/dapr/it/DaprRun.java Tweaks dapr stop handling and replaces fixed sleeps with bounded health polling to reduce flakes.
sdk-tests/src/test/java/io/dapr/it/actors/ActorTimerRecoveryIT.java Replaces fixed sleeps with longer retry windows for timer recovery validation.
sdk-tests/src/test/java/io/dapr/it/actors/ActorReminderRecoveryIT.java Replaces fixed sleeps with longer retry windows for reminder recovery validation.
Comments suppressed due to low confidence (1)

sdk-tests/src/test/java/io/dapr/it/DaprRun.java:246

  • waitForAppHealth’s HTTP path now uses a fixed 5s connectTimeout and no per-request timeout; a single client.send() can therefore block longer than maxWaitMilliseconds, defeating the method’s timeout contract (especially when maxWaitMilliseconds < 5000). Consider setting HttpRequest timeout (or using sendAsync with an explicit timeout) and deriving connectTimeout from the remaining time budget.
      long maxWait = System.currentTimeMillis() + maxWaitMilliseconds;
      HttpClient client = HttpClient.newBuilder()
          .version(HttpClient.Version.HTTP_1_1)
          .connectTimeout(Duration.ofSeconds(5))
          .build();
      String url = "http://127.0.0.1:" + this.getAppPort() + "/health";
      HttpRequest request = HttpRequest.newBuilder()
          .GET()
          .uri(URI.create(url))
          .build();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +148 to +156
if (running.get()) {
long backoffMs = Math.min(1000L * (1L << reconnectAttempts), 30000L);
try {
Thread.sleep(backoffMs);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
running.set(false);
}
reconnectAttempts++;
Comment on lines 176 to +181
} catch (RuntimeException e) {
System.out.println("Could not stop app " + this.appName + ": " + e.getMessage());
if (e.getMessage() != null && e.getMessage().contains("Could not find success criteria")) {
System.out.println("App " + this.appName + " already stopped or not found (ignored).");
} else {
System.out.println("Could not stop app " + this.appName + ": " + e.getMessage());
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants