Post

RabbitMQ, Docker Swarm, NFS & Segment Corruption: What happened and how I fixed it

RabbitMQ, Docker Swarm, NFS & Segment Corruption: What happened and how I fixed it

Recently, I ran into a critical issue in one of my environments involving RabbitMQ running on Docker Swarm with an NFS-mounted volume. This blog post is a breakdown of what went wrong, how I investigated the problem, and what I’m doing to prevent it from happening again. Hopefully, this helps someone out there facing similar issues.

My application suddenly couldn’t connect to RabbitMQ Streams, although regular AMQP connections were fine. I noticed this and restarted the RabbitMQ container using Portainer’s stack interface.

Shortly afterward, RabbitMQ crashed with the following error:

1
ra_log_segment_unexpected_eof "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq/coordination/rabbit@rabbitmq/.../00000001.segment"

This pointed to segment corruption in RabbitMQ’s Raft metadata log, which manages coordinated features like Streams.

Corrupted file

Root Cause Analysis

Here’s what I believe happened, step-by-step:

  1. The application couldn’t connect to RabbitMQ Streams — root cause before this is still unknown.
  2. RabbitMQ was restarted from Portainer.
  3. When the container was stopped:
    • Docker sent a SIGTERM to the container.
    • No stop_grace_period was configured, so Docker waited only 10 seconds before sending a SIGKILL.
    • RabbitMQ didn’t shut down in time, and was killed while still flushing logs.
  4. A new RabbitMQ container started too early, while the old one was still being terminated.
  5. Both containers accessed the same NFS volume mounted under /mnt/messaging.
  6. This caused concurrent writes to the .segment file, leading to corruption and triggering Raft recovery failures.

Key Discovery: Docker’s Stop Behavior

To better understand how Docker handles shutdowns, I wrote a small script that logs shutdown signals like SIGTERM. I added this script directly to the container’s entrypoint, allowing it to continuously log shutdown-related events. I then triggered shutdowns by clicking Stop Stack in Portainer, while simultaneously watching the container logs in real time. This helped me observe exactly how and when the container received signals and exited.

Here’s what I found:

  • With stop_grace_period: 90s, Docker waited the full 90 seconds after sending SIGTERM before stopping the container.
  • Without it, Docker force-killed the container after just 10 seconds, regardless of what the container was doing or logging.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash

WAITING_SEC=60
SIGTERM_RECEIVED_AT=-1

exit_script() {
    echo "SIGTERM received"
    SIGTERM_RECEIVED_AT=$SECONDS_PASSED
}

trap exit_script SIGINT SIGTERM

echo "- $WAITING_SEC seconds"

for ((n=$WAITING_SEC; n; n--))
do
    echo "- $n seconds"

    if [[ $SIGTERM_RECEIVED_AT -ge 0 ]]; then
        elapsed=$((SECONDS_PASSED - SIGTERM_RECEIVED_AT))
        echo "- SIGTERM received $elapsed seconds ago"
    fi

    sleep 1
    ((SECONDS_PASSED++))
done

echo "EXIT SUCCESSFULLY"

This reinforced the need to properly configure graceful shutdown windows for stateful services.

How I Fixed It

In Non-Production (Dev/Test)

If you’re okay with losing Streams metadata:

1
2
3
# Inside the container
cd /var/lib/rabbitmq/mnesia/rabbit@rabbitmq/coordination/rabbit@rabbitmq
mv RABBIT*/00000001.segment 00000001.segment.bak

Then restart the container.

This may wipe RabbitMQ Streams state!

In Production: Safe Recovery

  1. Stop all clients using RabbitMQ to avoid more damage.
  2. Take a full backup of the NFS volume.
  3. Run RabbitMQ diagnostics:
1
2
rabbitmq-diagnostics check_ra_log_integrity
rabbitmq-diagnostics repair_streams_metadata
  1. Only consider moving or removing segment files if recovery tools fail and you accept the data loss.
  2. Restoring from backups.

How I’m Preventing This

1. Set stop_grace_period: 90s

This ensures RabbitMQ has enough time to shut down cleanly before Docker forces termination.

1
2
    deploy:
      stop_grace_period: 90s

2. Add a Healthcheck

To make sure RabbitMQ only restarts when fully healthy:

1
2
3
4
5
    healthcheck:
      test: ["CMD-SHELL", "rabbitmq-diagnostics -q status || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5

3. Manual Check Before Restarting Before clicking “Start Stack” in Portainer, verify that the RabbitMQ container is fully stopped in the Containers view.

Lessons Learned

  • Stateful services like RabbitMQ must be given enough time to shut down.
  • Never run multiple containers against the same volume unless it’s clustered and explicitly safe to do so.
  • Portainer’s UI may not reflect the full container state — always check in the Containers tab.

RabbitMQ is a powerful message broker, but like any stateful app, it needs careful orchestration when running in Docker Swarm. If you’re running with NFS, you must ensure only one container has access at a time, and that shutdowns are graceful.

Hopefully this post helps someone avoid the same mistake.

This post is licensed under CC BY 4.0 by the author.