Operations grimoire/Incidents/2025-12-14-Hypervisor

From Nasqueron Agora

Our main hypervisor for production VMs didn't answer at all. A reboot was needed at datacenter level.

Impact was wide: web sites (including wiki), databases, web services, Docker containers (including DevCentral), Drake router, primary DNS

Incident timeline

All timestamps are UTC.

2025-12-14
  • 14:11 - Agora (web-001) and DevCentral (docker-002) up.
  • 14:02 - Routing restored between WindRiver and router-001 to help troubleshoot other GRE connections and routing.
  • 13:42 - On docker-002, nginx answers and serve 502.
  • 13:40 - Dorian started all machines.
  • 13:38 - Dorian confirms autostart is disabled on each machine.
  • 13:23 - Server rebooted by OVH, VMWare console answers.
  • 13:17 - Dereckson confirmed IPMI wasn't reacheable neither for console or reboot.
  • 12:54 - OVH monitoring warns us hyper-001 server is down.

Timestamps are UTC. Timestamps are an estimation, but 10:22 for MariaDB restart is accurate (from Salt).

Analysis

Server didn't respond to ping. Root cause is still unknown.

Fix

At OVH level, soft reboot of the server.

We also enabled autostart of the VMs, currently it was disabled and Dorian needed to manually start them

Actionables

  • T2197 Fix GRE tunnel between WindRiver and router-001
  • Enable autostart for VM (done)
  • Enable autostart for Docker containers (to do)
    • Create the network for containers needing acquisitariat, will be more stable than links for autostart