Spring WebFlux : reactor meltdown - slow responses

WebFlux is the reactive web framework for Spring. The programming model is easy but using it could be cumbersome if you don’t know the consequences when used incorrectly. This post will show what the consequences are when the reactive-stack is not used correctly. It will answer the question: why is my (health) endpoint sometimes so slow?

TL DR; Don’t block the event loop

It all started when sometimes our service was marked unhealthy. When marked as unhealthy it will be removed from the gateway and no traffic will be redirected to that service. After some investigation the health endpoint is sometimes slower than the configured max of 5 seconds. Why is our health endpoint so slow? Nothing really happens in the health endpoint, only a database check which is always fast. After disabling this health check, the endpoint is still slow sometimes. This is a while ago and we used a normal jdbc driver which is blocking. Maybe it is obvious to use a non blocking driver, if available. It still doesn’t answer the question why the health endpoint is slow when we do blocking I/O on another endpoint.

How does processing web requests work? First back to the good old Spring MVC, under the hood it’s Tomcat which uses one thread per request. When the request is done the thread is returned to the pool for next usage. The default configuration is 200 threads and the OS will schedule the threads on the CPU cores available. The reactive stack implementation runs an event loop on a single thread to schedule requests, there are as many event loops as the number of CPU cores. A request must be handled non-blocking with a callback for completion so the event loop can handle other requests. With less context switching this is more efficient and the server can process more requests per second. In WebFlux the event loop is implemented by Spring project reactor which processes the requests, under the hood it uses reactive Netty to handle the web requests. The default configuration of reactive Netty uses property reactor.netty.ioWorkerCount which is documented as "Default worker thread count, fallback to available processor (but with a minimum value of 4)"

Back to he slow responses of our health endpoint. When we have 4 threads to process requests and one is taking longer to process there are still 3 threads free. I would expect when there are threads free the requests are handled by the 3 free threads. Let’s try to reproduce what happens when the event loop is blocked. In the code below I added a sleep to block the thread, this could be also blocking I/O code or a slow algorithm. The Mono with the lambda function is returned and executed by WebFlux.

@RestController()
@RequestMapping("block")
public class BlockingEndpoint {
    @GetMapping(value = "time/{sleepMs}", produces = MediaType.TEXT_PLAIN_VALUE)
    public Mono<String> block(@PathVariable long sleepMs) {
       return Mono.fromSupplier(() -> blockingFunction(sleepMs));
    }

    private String blockingFunction(long sleepMs) {
        try {
            Thread.sleep(sleepMs);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        return "OK, wake up from " + sleepMs + " ms sleep";
    }

    @GetMapping(value = "health", produces = MediaType.TEXT_PLAIN_VALUE)
    public String health() {
        return "OK";
    }
}

Here is an integration test that calls the health endpoint 10 times concurrently and blocks one time.

@Execution(ExecutionMode.CONCURRENT)
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT)
class BlockingEndpointIT {

    @BeforeAll
    public static void beforeAll() {
        System.setProperty("reactor.netty.ioWorkerCount", "4"); // set to minimum 4, I got 16 CPU cores
    }

    private final WebTestClient webClient = WebTestClient.bindToServer()
        .baseUrl("http://localhost:8080")
        .responseTimeout(Duration.ofMillis(30000))
        .build();

    @Test
    public void testBlocking() {
        this.webClient.get().uri("/block/time/11000")
            .exchange()
            .expectStatus().isOk();
    }

    @ParameterizedTest
    @MethodSource("numberOfTests")
    void health(int nr) {
        this.webClient.get().uri("/block/health")
            .exchange()
            .expectStatus().isOk();
    }

    private static Stream<Arguments> numberOfTests() {
        return IntStream.range(1, 10).mapToObj(Arguments::arguments);
    }
}

We need to configure junit to run the tests concurrent, so we add the file junit-platform.properties

junit.jupiter.execution.parallel.enabled=true
junit.jupiter.execution.parallel.config.strategy=fixed
junit.jupiter.execution.parallel.config.fixed.parallelism=8

When the test is executed in IntelliJ IDEA we get something like the following result.

BlockingEndpointIT
health(int)             44s 827 ms
    [6] nr=6                411 ms
    [7] nr=7            10s 999 ms
    [5] nr=5                411 ms
    [2] nr=2                  1 ms
    [1] nr=1            10s 999 ms
    [4] nr=4            10s 993 ms
    [3] nr=3            10s 997 ms
    [8] nr=8                  8 ms
    [9] nr=9            10s 993 ms
testBlocking()          10s 993 ms

Even though there are 3 threads free the web requests are still scheduled on the blocked thread and will have to wait till the blocked thread is done. In the code a Mono with a lambda function is returned () → blockingFunction() but is not scheduled on another thread and will block the event loop.

How can we fix blocking the event loop? One way is to schedule on another thread. A Mono can be scheduled on another thread with subscribeOn. Let’s change the code in method block and rerun the test.

return Mono.fromSupplier(() -> blockingFunction(sleepMs))
        .subscribeOn(Schedulers.boundedElastic());

Result of the test

BlockingEndpointIT
health(int)
    [3] nr=3    1s 771 ms
    [6] nr=6        19 ms
    [1] nr=1       424 ms
    [5] nr=5        21 ms
    [4] nr=4       424 ms
    [2] nr=2         7 ms
    [8] nr=8         7 ms
    [9] nr=9        21 ms
    [7] nr=7    11s 38 ms
testBlocking()  11s 38 ms

What we see now is that the health endpoint doesn’t have to wait for the blocking function because it isn’t run on the event loop thread. There are different Schedulers, see the documentation for the most suitable for your use case.

Conclusion

Don’t use the reactive stack if you have a lot of blocking IO, try to use the non-blocking counterpart. For databases there is the R2DBC counterpart of the normal blocking JDBC driver. When you use Spring WebFlux then know where the blocking or slow code is -which isn’t always that obvious- and schedule it on another thread.

Project BlockHound could help to detect blocking code, check out the excellent presentation Avoiding Reactor Meltdown from Phil Clay.

Code can be found on GitHub

Written with Spring Boot 2.3.4