Current behaviour
- Upon receiving any shutdown message from the BEAM,
ConnectionServer.terminate/2 immediately closes the connection.
- RabbitMQ messages are still being processed successfully by the workers, but they cannot be acked because the connection to RabbitMQ was closed.
- RabbitMQ assumes the messages are lost after a specific timeout without connection and redelivers them. See Automatic requeueing.
- After recovering connection, the messages get redelivered, if they are not idempotent they might cause errors and corrupted state.
Expected behaviour
This is one of the possible solutions:
- Coney stops processing new RabbitMQ messages upon receiving a shutdown message from the BEAM.
- Existing RabbitMQ messages are processed and acked.
- Once all RabbitMQ messages have been processed, Coney shuts down.
Another solution could be to leave it up to the application code (workers) to deal with redelivered messages. Increasing the heartbeat timeout is not a solution since in rolling deployment styles the connection might be reopened from another host.
Technical information
- The children order in ApplicationSupervisor is reversed. First
ConnectionServer should start, and then ConsumerSupervisor. Since they are terminated in reverse order, first ConsumerSupervisor terminates all ConsumerServer, and finally ConnectionServer can close the connection.
ConsumerSupervisor does not have to be a DynamicSupervisor since we already know ahead of time the full list of consumers. It could be a regular Supervisor.
ConnectionServer can keep a map of {consumer, channel} so that ConsumerServer does not keep any connection (channel) state. That way when ConnectionServer receives a {:DOWN, _, _, _} message it only has to update its {consumer, channel} map list, and all the ConsumerServer are unaffected. ConsumerServer can be responsible only for processing the messages and creating the response, and ConnectionServer communicates with RabbitMQ.
Current behaviour
ConnectionServer.terminate/2immediately closes the connection.Expected behaviour
This is one of the possible solutions:
Another solution could be to leave it up to the application code (workers) to deal with redelivered messages. Increasing the heartbeat timeout is not a solution since in rolling deployment styles the connection might be reopened from another host.
Technical information
ConnectionServershould start, and thenConsumerSupervisor. Since they are terminated in reverse order, firstConsumerSupervisorterminates allConsumerServer, and finallyConnectionServercan close the connection.ConsumerSupervisordoes not have to be aDynamicSupervisorsince we already know ahead of time the full list of consumers. It could be a regularSupervisor.ConnectionServercan keep a map of{consumer, channel}so thatConsumerServerdoes not keep any connection (channel) state. That way whenConnectionServerreceives a{:DOWN, _, _, _}message it only has to update its{consumer, channel}map list, and all theConsumerServerare unaffected.ConsumerServercan be responsible only for processing the messages and creating the response, andConnectionServercommunicates with RabbitMQ.